Association Analysis 1

Size: px
Start display at page:

Download "Association Analysis 1"

Transcription

1 Association Analysis 1

2 Association Rules 1 3

3 Learning Outcomes Motivation Association Rule Mining Definition Terminology Apriori Algorithm Reducing Output: Closed and Maximal Itemsets 4

4 Overview One of the most popular problems in computer science! [Agrawal, Imielinski, Swami, SIGMOD 1993] 13th most cited across all computer science [Agrawal, Srikant, VLDB 1994] 15th most cited across all computer science 5

5 Overview Pattern Mining Unsupervised learning Useful for large datasets exploration: what is this data like? building global models Less suitable for well-studied and understood problem domains 6

6 Are humans inherently good at pattern mining? Is there a pattern? 7

7 Pattern or Illusion? 8

8 Pattern Mining Humans have always been doing pattern mining Observing & predicting nature The heavens (climate, navigation) Humans generally good at pattern recognition Illusion vs. pattern? We see what we want to see! (bias) Restricted to the natural dimensionality: 3D 9

9 Pattern vs. Chance Dog is not the pattern; the black patches are! But is that an interesting pattern? 10

10 What is a pattern? Repetitiveness Basically depends on counting Interestingness Avoid trivial patterns Chance occurrences Use statistical tests to weed these out 11

11 Motivation Frequent Item Set Mining is a method for market basket analysis. It aims at finding regularities in the shopping behavior of customers of supermarkets, mail-order companies, on-line shops etc. More specifically : Find sets of products that are frequently bought together. Possible applications of found frequent item sets: Improve arrangement of products in shelves, on a catalog's pages etc. Support cross-selling (suggestion of other products), product bundling. Fraud detection, technical dependence analysis etc. Often found patterns are expressed as association rules, for example: If a customer buys bread and wine, then she/he will probably also buy cheese. 12

12 Beer and Nappies - A Data Mining Urban Legend There is a story that a large supermarket chain, usually Wal- Mart, did an analysis of customers' buying habits and found a statistically significant correlation between purchases of beer and purchases of nappies (diapers in the US). It was theorized that the reason for this was that fathers were stopping off at Wal-Mart to buy nappies for their babies, and since they could no longer go down to the pub as often, would buy beer as well. As a result of this finding, the supermarket chain is alleged to have the nappies next to the beer, resulting in the increase of sales. Here's what really happened: "The analysis "did discover that between 5:00 and 7:00 p.m. that consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship by moving the products closer together on the shelves to increased sales of both. 13

13 Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket Transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality! 14

14 Definition: Frequent Itemset Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support count (σ) Frequency of occurrence of an itemset E.g. σ({milk, Bread,Diaper}) = 2 Support Fraction of transactions that contain an itemset E.g. s({milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 15

15 Definition: Association Rule Association Rule An implication expression of the form X Y, where X and Y are itemsets Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example: { Milk, Diaper} (Milk, Diaper,Beer) s = σ = T σ (Milk,Diaper,Beer) c = = σ (Milk,Diaper) = Beer =

16 Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having support minsup threshold confidence minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! 17

17 Mining Association Rules TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements 18

18 Mining Association Rules Two-step approach: 1. Frequent Itemset Generation Generate all itemsets whose support minsup 2. Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive 19

19 Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2 d possible candidate itemsets 20

20 Frequent Itemset Generation Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database N TID Items Candidate Itemsets 001 Bread, Milk, Apples, Juice, Chips, Soda 002 Soda, Chips 003 Bread, Milk, Butter 004 Apples, Soda, Chips, Milk, Bread 005 Cheese, Chips, Soda 006 Bread, Milk, Chips 007 Apples, Chips, Soda 008 Crackers, Apples, Butter Bread Milk Apples Chips M W Match each transaction against every candidate itemset Complexity: O(NWM) expensive since M = 2 x -1 21

21 Given d unique items: Total number of itemsets = 2 d Total number of possible association rules: Computational Complexity = = + = = d d d k k d j j k d k d R If d=6, R = 602 rules If d = 10, R = rules If d = 15, R = rules

22 Needle in a haystack!! 23

23 Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction 24

24 The Apriori Algorithm [Agrawal and Srikant 1994] 25

25 Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: X, Y : ( X Y ) s( X ) s( Y ) Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support 26

26 Illustrating Apriori Principle null A B C D E AB AC AD AE BC BD BE CD CE DE Found to be infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned supersets ABCDE 27

27 Database D TID Items Illustrating Apriori Principle Initial Database Count Prune Database D C 1 L 1 Join Count Prune TID A C D B C E A B C E B E Items 100 A C D Itemset 200 B C E {A B} 300 {A C} A B C E 400 {A E} B E {B C} Itemset {B E} {B {C C E} E} Itemset {A} Itemset {B} {A B} {C} {A C} {D} {A {E} {B C} Itemset {B E} {B{C C E} E} Join Scan D Count Prune Scan D Sup 2 Sup Sup 3 2 Scan DC 2 L 2 C 3 L 3 Itemset {A} Itemset {B} {A{C} {B{E} C} {B E} Itemset {C E} {B C E} minsup = 0.5 Sup 2 Sup Sup

28 Apriori Algorithm Method: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent 29

29 Factors Affecting Complexity Choice of minimum support threshold Lowering support threshold results in more frequent itemsets This may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set More space is needed to store support count of each item If number of frequent items also increases, both computation and I/O costs may also increase Size of database Since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width Transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width) 30

30 Reducing the Output: Closed and Maximal Item Sets 32

31 Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent null Maximal Itemsets A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Infrequent Itemsets ABCD E Border 33

32 Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset TID Items 1 {A,B} 2 {B,C,D} 3 {A,B,C,D} 4 {A,B,D} 5 {A,B,C,D} Itemset Support {A} 4 {B} 5 {C} 3 {D} 4 {A,B} 4 {A,C} 2 {A,D} 3 {B,C} 3 {B,D} 4 {C,D} 3 Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2 34

33 TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE Maximal vs Closed Itemsets null Transaction Ids A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE Not supported by any transactions ABCDE 35

34 Minimum support = 2 Maximal vs Closed Frequent Itemsets null Closed but not maximal A B C D E Closed and maximal AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE 36

35 Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets 37

36 Summary Association Rule Mining Apriori Algorithm Closed and Maximal Itemsets Next Class Alternative methods FP- Tree ECLAT 38

37 Association Rules 2 39

38 Learning Outcomes Alternative methods to Apriori (Last Lecture) FP- Tree ECLAT Rule Generation 40

39 Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice General-to-specific vs Specific-to-general Frequent itemset border null null Frequent itemset border null {a 1,a 2,...,a n } {a 1,a 2,...,a n } Frequent itemset border {a 1,a 2,...,a n } (a) General-to-specific (b) Specific-to-general (c) Bidirectional 41

40 Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice - Equivalent Classes null null A B C D A B C D AB AC AD BC BD CD AB AC BC AD BD CD ABC ABD ACD BCD ABC ABD ACD BCD ABCD ABCD (a) Prefix tree (b) Suffix tree 42

41 Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice Breadth-first vs Depth-first (a) Breadth first (b) Depth first 43

42 Alternative Methods for Frequent Itemset Generation Representation of Database horizontal vs vertical data layout Horizontal Data Layout TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Vertical Data Layout A B C D E

43 Rule Generation 45

44 Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If L = k, then there are 2 k 2 candidate association rules (ignoring L and L) 46

45 Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(abc D) can be larger or smaller than c(ab D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(abc D) c(ab CD) c(a BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule 47

46 Lattice of rules Low Confidence Rule Rule Generation for Apriori Algorithm ABCD=>{ } BCD=>A ACD=>B ABD=>C ABC=>D CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD Pruned Rules D=>ABC C=>ABD B=>ACD A=>BCD 48

47 Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share the same prefix in the rule consequent join(cd=>ab,bd=>ac) would produce the candidate rule D => ABC CD=>AB BD=>AC Prune rule D=>ABC if its subset AD=>BC does not have high confidence D=>ABC 49

48 Apriori Apriori: uses a generate-and-test approach generates candidate itemsets and tests if they are frequent Generation of candidate itemsets is expensive (in both space and time) Support counting is expensive Subset checking (computationally expensive) Multiple Database scans (I/O) 50

49 The FP-Growth Algorithm Frequent Pattern Growth Algorithm [Han, Pei, and Yin 2000] 51

50 FP-Growth Algorithm FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach: Step 1: Build a compact data structure called the FP-tree Built using 2 passes over the data-set. Step 2: Extracts frequent itemsets directly from the FP-tree Traversal through FP-Tree 52

51 Finding Frequent Itemsets Input: The FP-tree Output: All Frequent Itemsets and their support Method: Divide and Conquer: Consider all itemsets that end in: E, D, C, B, A For each possible ending item, consider the itemsets with last items one of items preceding it in the ordering E.g, for E, consider all itemsets with last item D, C, B, A. This way we get all the itesets ending at DE, CE, BE, AE Proceed recursively this way. Do this for all items.

52 Frequent itemsets All Itemsets Ε D C B A DE CE BE AE CD BD AD BC AC AB CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C ACDE BCDE ABDE ABCE ABCD ABCDE

53 Frequent Itemsets All Itemsets Ε D C B A Frequent?; DE CE BE AE CD BD AD BC AC AB Frequent?; CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C Frequent? ACDE BCDE ABDE ABCE ABCD ABCDE Frequent?

54 Frequent Itemsets All Itemsets Frequent? Ε D C B A DE CE BE AE CD BD AD BC AC AB Frequent? Frequent? CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C ACDE BCDE ABDE ABCE ABCD ABCDE Frequent? Frequent?

55 Frequent Itemsets All Itemsets Frequent? Ε D C B A DE CE BE AE CD BD AD BC AC AB Frequent? CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C Frequent? ACDE BCDE ABDE ABCE ABCD ABCDE We can generate all itemsets this way We expect the FP-tree to contain a lot less

56 Using the FP-tree to find frequent TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Header table Item Pointer A B C D E Transaction Database C:3 B:5 itemsets A:7 C:1 null E:1 E:1 B:3 C:3 Bottom-up traversal of the tree. E:1 First, itemsets ending in E, then D, etc, each time a suffix-based class

57 Finding Frequent Itemsets Subproblem: find frequent itemsets ending in E null A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1 We will then see how to compute the support for the possible itemsets

58 Finding Frequent Itemsets null Ending in D A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

59 Finding Frequent Itemsets null Ending in C A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

60 Finding Frequent Itemsets null Ending in B A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

61 Finding Frequent Itemsets Ending in Α null A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

62 Algorithm For each suffix X Phase 1 Construct the prefix tree for X as shown before, and compute the support using the header table and the pointers Phase 2 If X is frequent, construct the conditional FP-tree for X in the following steps 1. Recompute support 2. Prune infrequent items 3. Prune leaves and recurse

63 Example Phase 1 construct prefix tree null Find all prefix paths that contain E A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1 Suffix Paths for Ε: {A,C,D,E}, {A,D,Ε}, {B,C,E}

64 Example Phase 1 construct prefix tree Find all prefix paths that contain E null A:7 B:3 C:1 C:3 E:1 E:1 E:1 Prefix Paths for Ε: {A,C,D,E}, {A,D,Ε}, {B,C,E}

65 Example Compute Support for E (minsup = 2) How? Follow pointers while summing up counts: = 3 > 2 E is frequent null A:7 B:3 C:1 C:3 E:1 E:1 E:1 {E} is frequent so we can now consider suffixes DE, CE, BE, AE

66 Example E is frequent so we proceed with Phase 2 Phase 2 Convert the prefix tree of E into a conditional FP-tree Two changes (1) Recompute support (2) Prune infrequent null A:7 B:3 C:3 C:1 E:1 E:1 E:1

67 Example Recompute Support null A:7 B:3 The support counts for some of the nodes include transactions that do not end in E For example in null->b->c->e we count {B, C} C:1 C:3 The support of any node is equal to the sum of the support of leaves with label E in its subtree E:1 E:1 E:1

68 Example null A:7 B:3 C:1 C:3 E:1 E:1 E:1

69 Example null A:7 B:3 C:1 C:1 E:1 E:1 E:1

70 Example null A:7 B:1 C:1 C:1 E:1 E:1 E:1

71 Example null A:7 B:1 C:1 C:1 E:1 E:1 E:1

72 Example null A:7 B:1 C:1 C:1 E:1 E:1 E:1

73 Example null A:2 B:1 C:1 C:1 E:1 E:1 E:1

74 Example null A:2 B:1 C:1 C:1 E:1 E:1 E:1

75 Example null Truncate A:2 B:1 Delete the nodes of Ε C:1 C:1 E:1 E:1 E:1

76 Example null Truncate A:2 B:1 Delete the nodes of Ε C:1 C:1 E:1 E:1 E:1

77 Example null Truncate A:2 B:1 Delete the nodes of Ε C:1 C:1

78 Example null Prune infrequent A:2 B:1 In the conditional FPtree some nodes may have support less than minsup C:1 C:1 e.g., B needs to be pruned This means that B appears with E less than minsup times

79 Example null A:2 B:1 C:1 C:1

80 Example null A:2 C:1 C:1

81 Example null A:2 C:1 C:1 The conditional FP-tree for E Repeat the algorithm for {D, E}, {C, E}, {A, E}

82 Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain D (DE) in the conditional FP-tree

83 Example null A:2 C:1 Phase 1 Find all prefix paths that contain D (DE) in the conditional FP-tree

84 Example null A:2 C:1 Compute the support of {D,E} by following the pointers in the tree 1+1 = 2 2 = minsup {D,E} is frequent

85 Example null A:2 C:1 Phase 2 Construct the conditional FP-tree 1. Recompute Support 2. Prune nodes

86 Example null A:2 Recompute support C:1

87 Example null A:2 Prune nodes C:1

88 Example null A:2 Prune nodes C:1

89 Example null A:2 Prune nodes C:1 Small support

90 Example null A:2 Final condition FP-tree for {D,E} The support of A is minsup so {A,D,E} is frequent Since the tree has a single node we return to the next subproblem.

91 Example null A:2 C:1 C:1 The conditional FP-tree for E We repeat the algorithm for {D,E}, {C,E}, {A,E}

92 Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain C (CE) in the conditional FP-tree

93 Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain C (CE) in the conditional FP-tree

94 Example null A:2 C:1 C:1 Compute the support of {C,E} by following the pointers in the tree 1+1 = 2 2 = minsup {C,E} is frequent

95 Example null A:2 C:1 C:1 Phase 2 Construct the conditional FP-tree 1. Recompute Support 2. Prune nodes

96 Example null A:1 C:1 Recompute support C:1

97 Example null A:1 C:1 Prune nodes C:1

98 Example null A:1 Prune nodes

99 Example null A:1 Prune nodes

100 Example null Prune nodes Return to the previous subproblem

101 Example null A:2 C:1 C:1 The conditional FP-tree for E We repeat the algorithm for {D,E}, {C,E}, {A,E}

102 Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain A (AE) in the conditional FP-tree

103 Example null A:2 Phase 1 Find all prefix paths that contain A (AE) in the conditional FP-tree

104 Example null A:2 Compute the support of {A,E} by following the pointers in the tree 2 minsup {A,E} is frequent There is no conditional FP-tree for {A,E}

105 Example So for E we have the following frequent itemsets {E}, {D,E}, {C,E}, {A,E} We proceed with D

106 Example Ending in D A:7 null B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

107 Example null Phase 1 construct prefix tree Find all prefix paths that contain D A:7 B:3 Support 5 > minsup, D is frequent Phase 2 Convert prefix tree into conditional FP-tree C:3 B:5 C:1 C:3

108 Example null A:7 B:3 B:5 C:1 C:3 C:1 Recompute support

109 Example null A:7 B:3 B:2 C:1 C:3 C:1 Recompute support

110 Example null A:3 B:3 B:2 C:1 C:3 C:1 Recompute support

111 Example null A:3 B:3 B:2 C:1 C:1 C:1 Recompute support

112 Example null A:3 B:1 B:2 C:1 C:1 C:1 Recompute support

113 Example null A:3 B:1 B:2 C:1 C:1 C:1 Prune nodes

114 Example null A:3 B:1 B:2 C:1 C:1 C:1 Prune nodes

115 Example null A:3 B:1 B:2 C:1 C:1 C:1 Construct conditional FP-trees for {C,D}, {B,D}, {A,D} And so on.

116 Observations At each recursive step we solve a subproblem Construct the prefix tree Compute the new support Prune nodes Subproblems are disjoint so we never consider the same itemset twice Support computation is efficient happens together with the computation of the frequent itemsets.

117 Observations The efficiency of the algorithm depends on the compaction factor of the dataset If the tree is bushy then the algorithm does not work well, it increases a lot of number of subproblems that need to be solved.

118 FP-Tree size The FP-Tree usually has a smaller size than the uncompressed data typically many transactions share items (and hence prefixes). Best case scenario: all transactions contain the same set of items. 1 path in the FP-tree Worst case scenario: every transaction has a unique set of items (no items in common) Size of the FP-tree is at least as large as the original data. Storage requirements for the FP-tree are higher need to store the pointers between the nodes and the counters. The size of the FP-tree depends on how the items are ordered Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic). 120

119 Discussion Advantages of FP-Growth only 2 passes over dataset compresses dataset no candidate generation much faster than Apriori Disadvantages of FP-Growth FP-Tree may not fit in memory!! FP-Tree is expensive to build Trade-off: takes time to build, but once it is built, frequent itemsets are read off easily. Time is wasted (especially if support threshold is high), as the only pruning that can be done is on single items. support can only be calculated once the entire dataset is added to the FP-Tree. 121

120 Should the FP-Growth be used in favor of Apriori? Does FP-Growth have a better complexity than Apriori? Does FP-Growth always have a better running time than Apriori? More relevant questions What kind of data is it? What kind of results do I want? How will I analyze the resulting rules? 122

121 FP Growth vs Apriori D1 FP-grow th runtime D1 Apriori runtime Run time(sec.) Data set T25I20D10K Support threshold(%) 123

122 The Eclat Algorithm [Zaki, Parthasarathy, Ogihara, and Li 1997] 124

123 ECLAT For each item, store a list of transaction ids (tids) Horizontal Data Layout TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Vertical Data Layout A B C D E TID-list 128

124 ECLAT Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. A B traversal approaches: top-down, bottom-up and hybrid Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory AB

125 Partitioning The set of transactions may be divided into a number of disjoint subsets. Then each partition is searched for frequent itemsets. These frequent itemsets are called local frequent itemsets. How can information about local itemsets be used in finding frequent itemsets of the global set of transactions? Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB. 130

126 Partitioning Phase 1 Divide n transactions into m partitions Find the frequent itemsets in each partition Combine all local frequent itemsets to form candidate itemsets Phase 2 Find global frequent itemsets December 2008 GKGupta 131

127 Sampling A random sample (usually large enough to fit in the main memory) may be obtained from the overall set of transactions and the sample is searched for frequent itemsets. These frequent itemsets are called sample frequent itemsets. How can information about sample itemsets be used in finding frequent itemsets of the global set of transactions? 132

128 Sampling Not guaranteed to be accurate but we sacrifice accuracy for efficiency. A lower support threshold may be used for the sample to ensure not missing any frequent datasets. The actual frequencies of the sample frequent itemsets are then obtained. More than one sample could be used to improve accuracy. 133

129 Summary We looked at 2 alternative methods FP- Tree ECLAT Rule Generation Phase 134

130 Association Rule 3 135

131 Learning Outcomes Pattern Evaluation Advance Pattern Mining Techniques 136

132 Pattern Evaluation 137

133 The Knowledge Discovery Process Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad et al. 1996) Focus on the quality of discovered patterns independent of the data mining algorithm This definition is often quoted, but not very seriously taken into account

134 Interestingness of Discovered Patterns useful Amount of Research novel, surprising comprehensible valid (accurate) Difficulty of measurement 139

135 Selecting the Most Interesting Rules in a Post-Processing Step Goal: select a subset of interesting rules out of all discovered rules, in order to show to the user just the most interesting rules User-driven, subjective approach Uses the domain knowledge, believes or preferences of the user Data-driven, objective approach Based on statistical properties of discovered patterns More generic, independent of the application domain and user

136 Data-driven vs user-driven rule interestingness measures all rules all rules Note: user-driven data-driven measure selected rules user user-driven measure selected rules measures real user interest believes, Dom. Knowl. user

137 Data-driven Approach Objective Measures 142

138 Data-driven approaches Two broad groups of approaches: Approach based on the statistical strength of patterns entirely data-driven, no user-driven aspect Based on the assumption that a certain kind of pattern is surprising to any user, in general But still independent from the application domain and from believes / domain knowledge of the target user So, mainly data-driven

139 Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used 144

140 Application of Interestingness Measure Interestingness Measures 145

141 Computing Interestingness Measure Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Y Y X f 11 f 10 f 1+ X f 01 f 00 f o+ f +1 f +0 T f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, etc. 146

142 Drawback of Confidence Coffee Coffee Tea Tea Association Rule: Tea Coffee Confidence= P(Coffee Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee Tea) =

143 Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(S B) = 420/1000 = 0.42 P(S) P(B) = = 0.42 P(S B) = P(S) P(B) => Statistical independence P(S B) > P(S) P(B) => Positively correlated P(S B) < P(S) P(B) => Negatively correlated 148

144 Measures that take into account statistical dependence Statistical-based Measures 149 )] ( )[1 ( )] ( )[1 ( ) ( ) ( ), ( ) ( ) ( ), ( ) ( ) ( ), ( ) ( ) ( Y P Y P X P X P Y P X P Y X P coefficient Y P X P Y X P PS Y P X P Y X P Interest Y P X Y P Lift = = = = φ

145 Example: Lift/Interest Coffee Coffee Tea Tea Association Rule: Tea Coffee Confidence= P(Coffee Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= (< 1, therefore is negatively associated) 150

146 Drawback of Lift & Interest Y Y X X Y Y X X Lift = = Lift = = (0.1)(0.1) (0.9)(0.9) Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 151

147 There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Aprioristyle support based pruning? How does it affect these measures? 152

148 Properties of A Good Measure Piatetsky-Shapiro: 3 properties a good measure M must satisfy: M(A,B) = 0 if A and B are statistically independent M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged 153

149 Comparing Different Measures 10 examples of contingency tables: Rankings of contingency tables using various measures: Example f 11 f 10 f 01 f 00 E E E E E E E E E E

150 Property under Variable Permutation B B A p q A r s A A B p r B q s Does M(A,B) = M(B,A)? Symmetric measures: support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: confidence, conviction, Laplace, J-measure, etc 155

151 Property under Row/Column Scaling Grade-Gender Example (Mosteller, 1968): Male Female High Low Male Female High Low x 10x Mosteller: Underlying association should be independent of the relative number of male and female students in the samples 156

152 Example: φ-coefficient φ-coefficient is analogous to correlation coefficient for continuous variables Y Y X X Y Y X X φ = = φ = φ Coefficient is the same for both tables = 157

153 Property under Null Addition B B A p q A r s B B A p q A r s + k Invariant measures: support, cosine, Jaccard, etc Non-invariant measures: correlation, Gini, mutual information, odds ratio, etc 158

154 Different Measures have Different Properties Symbol Measure Range P1 P2 P3 O1 O2 O3 O3' O4 Φ Correlation Yes Yes Yes Yes No Yes Yes No λ Lambda 0 1 Yes No No Yes No No* Yes No α Odds ratio 0 1 Yes* Yes Yes Yes Yes Yes* Yes No Q Yule's Q Yes Yes Yes Yes Yes Yes Yes No Y Yule's Y Yes Yes Yes Yes Yes Yes Yes No κ Cohen's Yes Yes Yes Yes No No Yes No M Mutual Information 0 1 Yes Yes Yes Yes No No* Yes No J J-Measure 0 1 Yes No No No No No No No G Gini Index 0 1 Yes No No No No No* Yes No s Support 0 1 No Yes No Yes No No No No c Confidence 0 1 No Yes No Yes No No No Yes L Laplace 0 1 No Yes No Yes No No No No V Conviction No Yes No Yes** No No Yes No I Interest 0 1 Yes* Yes Yes Yes No No No No IS IS (cosine) No Yes Yes Yes No No No Yes PS Piatetsky-Shapiro's Yes Yes Yes Yes No Yes Yes No F Certainty factor Yes Yes Yes No No No Yes No AV Added value Yes Yes Yes No No No No No S Collective strength 0 1 No Yes Yes Yes No Yes* Yes No ζ Jaccard No Yes Yes Yes No No No Yes K Klosgen's Yes Yes Yes No No No No No

155 Support-based Pruning Most of the association rule mining algorithms use support measure to prune rules and itemsets Study effect of support pruning on correlation of itemsets Generate random contingency tables Compute support and pairwise correlation for each table Apply support-based pruning and examine the tables that are removed 160

156 Effect of Support-based Pruning All Itempairs Correlation

157 Effect of Support-based Pruning Support < 0.01 Support < Correlation Correlation Support < Support-based pruning eliminates mostly negatively correlated itemsets Correlation 162

158 Effect of Support-based Pruning Investigate how support-based pruning affects other measures Steps: Generate contingency tables Rank each table according to the different measures Compute the pair-wise correlation between the measures 163

159 Effect of Support-based Pruning Without Support Pruning (All Pairs) Conviction Odds ratio Col Strength Correlation Interest PS CF Yule Y Reliability Kappa Klosgen Yule Q Confidence Laplace IS Support Jaccard Lambda Gini J-measure Mutual Info All Pairs (40.14%) Red cells indicate correlation between the pair of measures > % pairs have correlation > 0.85 Jaccard Correlation Scatter Plot between Correlation & Jaccard Measure 164

160 Effect of Support-based Pruning 0.5% support 50% Interest Conviction Odds ratio <= support <= (61.45%) Col Strength Laplace Confidence Correlation Klosgen Reliability PS Yule Q CF Yule Y Kappa IS Jaccard Support Lambda Gini J-measure Mutual Info % pairs have correlation > 0.85 Jaccard Correlation Scatter Plot between Correlation & Jaccard Measure: 165

161 Effect of Support-based Pruning 0.5% support 30% Support Interest Reliability <= support <= (76.42%) Conviction Yule Q Odds ratio Confidence CF Yule Y Kappa Correlation Col Strength IS Jaccard Laplace PS Klosgen Lambda Mutual Info Gini J-measure % pairs have correlation > 0.85 Jaccard Correlation Scatter Plot between Correlation & Jaccard Measure 166

162 Rule interestingness measures based on the statistical strength of the patterns More than 50 measures have been called rule interestingness measures in the literature The vast majority of these measures are based only on statistical strength or mathematical properties (Hilderman & Hamilton 2001), (Tan et al. 2002) Are they correlated with real human interest??

163 Evaluating the correlation between data-driven measures and real human interest 3-step Methodology: Rank discovered rules according to each of a number of datadriven rule interestingness measures Show (a subset of) discovered rules to the user, who evaluates how interesting each rule is Measure the linear correlation between the ranking of each datadriven measure and real human interest Rules are evaluated by an expert user, who also provided the dataset for the data miner

164 The Study (Carvalho et al. 2005) 8 datasets, one user for each dataset 11 data-driven rule interestingness measures Each data-driven measure was evaluated over 72 <user-rule> pairs (8 users 9 rules/dataset) 88 correlation values (8 users 11 measures) 31 correlation values (35%) were 0.6 The correlations associated with each measure varied considerably across the 8 datasets/users

165 Data-driven approaches based on finding specific kinds of surprising patterns Principle: a few special kinds of patterns tend to be surprising to most users, regardless of the meaning of the attributes in the pattern and the application domain Does not require domain knowledge or believes specified by user We discuss two approaches: Discovery of exception rules contradicting general rules Detection of Simpson s paradox

166 Selecting Surprising Rules Which Are Exceptions of a General Rule Data-driven approach: consider the following 2 rules R 1 : IF Cond 1 THEN Class 1 (general rule) R 2 : IF Cond 1 AND Cond 2 THEN Class 2 (exception) Cond 1, Cond 2 are conjunctions of conditions Rule R 2 is a specialization of rule R 1 Rules R 1 and R 2 predict different classes An exception rule is interesting if both the rule and its general rule have a high predictive accuracy (Suzuki & Kodratoff 1998), (Suzuki 2004)

167 Selecting Surprising Exception Rules an Example Mining data about car accidents (Suzuki 2004) IF (used_seat_belt = yes) THEN (injury = no) (general rule) IF (used_seat_belt = yes) AND (passenger = child) THEN (injury = yes) (exception rule) If both rules are accurate, then the exception rule would be considered interesting (it is both accurate and surprising)

168 Simpson Paradox What s the confidence of the following rules: (rule 1) {HDTV=Yes} {Exercise machine = Yes} (rule 2) {HDTV=No} {Exercise machine = Yes}? Confidence of rule 1 = 99/180 = 55% Confidence of rule 2 = 54/120 = 45% So, Customers who buy high-definition televisions are more likely to buy exercise machines that those who don t buy high-definition televisions. Right? Well, maybe not

169 Stratification: Simpson paradox Consider this more detailed table: What s the confidence of the rules for each strata: (rule 1) {HDTV=Yes} {Exercise machine = Yes} (rule 2) {HDTV=No} {Exercise machine = Yes}? College students: Confidence of rule 1 = 1/10 = 10% Confidence of rule 2 = 4/34 = 11.8% Working Adults: Confidence of rule 1 = 98/170 = 57.7% Confidence of rule 2 = 50/86 = 58.1% The rules suggest that, for each group, customers who don t buy HDTV are more likely to buy exercise machines, which contradict the previous conclusion when data from the two customer groups are pooled together.

170 Importance of Stratification The lesson here is that proper stratification is needed to avoid generating spurious patterns resulting from Simpson's paradox. For example Market basket data from a major supermarket chain should be stratified according to store locations, while Medical records from various patients should be stratified according to confounding factors such as age and gender.

171 Effect of Support Distribution Many real data sets have skewed support distribution where most of the items have relatively low to moderate frequencies, but a small number of them have very high frequencies.

172 Skewed distribution Tricky to choose the right support threshold for mining such data sets. If we set the threshold too high (e.g., 20%), then we may miss many interesting patterns involving the low support items from G1. Such low support items may correspond to expensive products (such as jewelry) that are seldom bought by customers, but whose patterns are still interesting to retailers. Conversely, when the threshold is set too low, there is the risk of generating spurious patterns that relate a high frequency item such as milk to a low frequency item such as caviar.

173 Cross-support patterns Cross-support patterns are those that relate a high frequency item such as milk to a low frequency item such as caviar. Likely to be spurious because their correlations tend to be weak. E.g. the confidence of {caviar} {milk} is likely to be high, but still the pattern is spurious, since there isn t probably any correlation between caviar and milk. However, we don t want to use Lift during the computation of frequent itemsets because it doesn t have the anti-monotone property. Lift is rather used as a post-processing step. So, we want to detect cross-support pattern by looking at some antimonotone property.

174 Cross-support patterns Definition A cross-support pattern is an itemset X = {i 1, i 2,, i k } whose support ratio min r( X ) = max { s( i1 ), s( i2 ),..., s( ik )} { s( i ), s( i ),..., s( i )} is less than a user-specified threshold h c. 1 2 k Example Suppose the support for milk is 70%, while the support for sugar is 10% and caviar is 0.04% Given h c = 0.01, the frequent itemset {milk, sugar, caviar} is a crosssupport pattern because its support ratio is r = min {0.7, 0.1, } / max {0.7, 0.1, } = / 0.7 = < 0.01

175 Detecting cross-support patterns E.g. assuming that h c = 0.3, the itemsets {p,q}, {p,r}, and {p,q,r} are cross-support patterns. Because their support ratios, being equal to 0.2, are less than threshold h c. We can apply a high support threshold, say, 20%, to eliminate the cross-support patterns but, this may come at the expense of discarding other interesting patterns such as the strongly correlated itemset {q,r} that has support equal to 16.7%.

176 Detecting cross-support patterns Confidence pruning also doesn t help. Confidence for {q} {p} is 80% even though {p, q} is a crosssupport pattern. Meanwhile, rule {q} {r} also has high confidence even though {q, r} is not a crosssupport pattern. These demonstrate the difficulty of using the confidence measure to distinguish between rules extracted from cross-support and noncross-support patterns. 181

177 Lowest confidence rule Notice that the rule {p} {q} has very low confidence because most of the transactions that contain p do not contain q. This observation suggests that: Cross-support patterns can be detected by examining the lowest confidence rule that can be extracted from a given itemset.

178 Finding lowest confidence Recall the anti-monotone property of confidence: conf( i 1,i 2 i 3,i 4,,i k ) conf( i 1,i 2, i 3 i 4,,i k ) This property suggests that confidence never increases as we shift more items from the left to the right-hand side of an association rule. Hence, the lowest confidence rule that can be extracted from a frequent itemset contains only one item on its left-hand side.

179 Finding lowest confidence Given a frequent itemset {i 1,i 2,i 3,i 4,,i k }, the rule i j i 1,i 2, i 3, i j-1, i j+1, i 4,,i k has the lowest confidence if s(i j ) = max {s(i 1 ), s(i 2 ),,s(i k )} Follows directly from the definition of confidence as the ratio between the rule's support and the support of the rule antecedent.

180 Finding lowest confidence Summarizing, the lowest confidence attainable from a frequent itemset {i 1,i 2,i 3,i 4,,i k }, is s( { i1, i2,..., ik }) max{ s( i ), s( i ),..., s( i )} 1 2 This is also known as the h-confidence measure or all-confidence measure. k

181 and thus can be incorporated directly into h-confidence Clearly, cross-support patterns can be eliminated by ensuring that the h-confidence values for the patterns exceed some threshold hc. Observe that the measure is also antimonotone, i.e., hconfidence({i1,i2,, ik}) h- confidence({i1,i2,, ik+1 })

182 User Driven Approach Subjective Approach 187

183 Subjective Interestingness Measure Objective measure: Rank patterns based on statistics computed from data Subjective measure: Rank patterns according to user s interpretation A pattern is subjectively interesting if it contradicts the expectation of a user (Silberschatz & Tuzhilin) A pattern is subjectively interesting if it is actionable (Silberschatz & Tuzhilin) 188

184 Interestingness via Unexpectedness Need to model expectation of users (domain knowledge) + Pattern expected to be frequent - Pattern expected to be infrequent Pattern found to be frequent Pattern found to be infrequent Expected Patterns + Unexpected Patterns Need to combine expectation of users with evidence from data (i.e., extracted patterns) 189

185 Selecting Interesting Association Rules based on User-Defined Templates User-driven approach: the user specifies templates for interesting and non-interesting rules Inclusive templates specify interesting rules Restrictive templates specify non-interesting rules A rule is considered interesting if it matches at least one inclusive template and it matches no restrictive template (Klemettinen et al. 1994)

186 Selecting Interesting Association Rules Hierarchies of items and their categories food drink clothes rice fish hot dog coke wine water trousers socks Inclusive template: IF (food) THEN (drink) Restrictive template: IF (hot dog) THEN (coke) Rule IF (fish) THEN (wine) is interesting

187 Selecting Surprising Rules based on General Impressions User-driven approach: the user specifies her/his general impressions about the application domain (Liu et al. 1997, 2000); (Romao et al. 2004) Example: IF (salary = high) THEN (credit = good) (different from reasonably precise knowledge, e.g. IF salary > 35,500 ) Discovered rules are matched with the user s general impressions in a post-processing step, in order to determine the most surprising (unexpected) rules

188 Selecting Surprising Rules based on General Impressions Case I rule with unexpected consequent A rule and a general impression have similar antecedents but different consequents Case II rule with unexpected antecedent A rule and a general impression have similar consequents but the rule antecedent is different from the antecedent of all general impressions In both cases the rule is considered unexpected

189 Selecting Surprising Rules based on General Impressions GENERAL IMPRESSION: IF (salary = high) AND (education_level = high) THEN (credit = good) UNEXPECTED RULES: IF (salary > 30,000) AND (education BSc) AND (mortage = yes) THEN (credit = bad) /* Case I - unexpected consequent */ IF (mortgage = no) AND (C/A_balance > 10,000) THEN (credit = good) /* Case II unexpected antecedent */ /* assuming no other general impression has a similar antecedent */

190 Summary Association Rule Induction is a Two Step Process Find the frequent item sets (minimum support). Form the relevant association rules (minimum confidence). Generating the Association Rules Form all possible association rules from the frequent item sets. Filter interesting" association rules 195

191 Readings Introduction to Data Mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar - Chapter 6 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT 96. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD 00. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94. S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97. E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE 03. Charu C. Aggarwal and Jiawei Han Frequent Pattern Mining. Springer Publishing Company, Incorporated. 196

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei And slides provide by Raymond

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - 1DL36 Fall 212" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht12 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - DL360 Fall 200 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht0 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 04 Association Analysis Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Association

More information

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017) Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar (modified by Predrag Radivojac, 27) Association Rule Mining Given a set of transactions, find rules that will predict the

More information

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Chapter 5 Association Analysis: Basic Concepts Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 2/3/28 Introduction to Data Mining Association Rule Mining Given

More information

CS 584 Data Mining. Association Rule Mining 2

CS 584 Data Mining. Association Rule Mining 2 CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M

More information

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP) 3.9.2 Descrip9ve data analysis Data Mining MTAT.3.83 (6EAP) hp://courses.cs.ut.ee/2/dm/ Frequent itemsets and associa@on rules Jaak Vilo 2 Fall Aims to summarise the main qualita9ve traits of data. Used

More information

CS 484 Data Mining. Association Rule Mining 2

CS 484 Data Mining. Association Rule Mining 2 CS 484 Data Mining Association Rule Mining 2 Review: Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due

More information

Association Analysis Part 2. FP Growth (Pei et al 2000)

Association Analysis Part 2. FP Growth (Pei et al 2000) Association Analysis art 2 Sanjay Ranka rofessor Computer and Information Science and Engineering University of Florida F Growth ei et al 2 Use a compressed representation of the database using an F-tree

More information

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example .3.6 Descriptive data analysis Data Mining MTAT.3.83 (6EAP) Frequent itemsets and association rules Jaak Vilo 26 Spring Aims to summarise the main qualitative traits of data. Used mainly for discovering

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms DATA MINING LECTURE 4 Frequent Itemsets, Association Rules Evaluation Alternative Algorithms RECAP Mining Frequent Itemsets Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

Association Rules. Fundamentals

Association Rules. Fundamentals Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule

More information

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci Association Rules Information Retrieval and Data Mining Prof. Matteo Matteucci Learning Unsupervised Rules!?! 2 Market-Basket Transactions 3 Bread Peanuts Milk Fruit Jam Bread Jam Soda Chips Milk Fruit

More information

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s 7.3.7 Descriptive data analysis Data Mining MTAT.3.83 (6EAP) Frequent itemsets and association rules Jaak Vilo 27 Spring Aims to summarise the main qualitative traits of data. Used mainly for discovering

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions. Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules DATA MINING LECTURE 4 Frequent Itemsets and Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.

More information

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING, LIG M2 SIF DMV course 207/208 Market basket analysis Analyse supermarket s transaction data Transaction = «market basket» of a customer Find which items are

More information

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1 Outline n Association Rule

More information

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Association Rule. Lecturer: Dr. Bo Yuan. LOGO Association Rule Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Overview Frequent Itemsets Association Rules Sequential Patterns 2 A Real Example 3 Market-Based Problems Finding associations

More information

732A61/TDDD41 Data Mining - Clustering and Association Analysis

732A61/TDDD41 Data Mining - Clustering and Association Analysis 732A61/TDDD41 Data Mining - Clustering and Association Analysis Lecture 6: Association Analysis I Jose M. Peña IDA, Linköping University, Sweden 1/14 Outline Content Association Rules Frequent Itemsets

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Chapters 6 & 7, Frequent Pattern Mining

Chapters 6 & 7, Frequent Pattern Mining CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapters

More information

Selecting a Right Interestingness Measure for Rare Association Rules

Selecting a Right Interestingness Measure for Rare Association Rules Selecting a Right Interestingness Measure for Rare Association Rules Akshat Surana R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining

More information

Association)Rule Mining. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management

Association)Rule Mining. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management Association)Rule Mining Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management What)is)Association)Rule)Mining? Finding frequent patterns,1associations,1correlations,1or

More information

Data mining, 4 cu Lecture 5:

Data mining, 4 cu Lecture 5: 582364 Data mining, 4 cu Lecture 5: Evaluation of Association Patterns Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Evaluation of Association Patterns Association rule algorithms

More information

Chapter 4: Frequent Itemsets and Association Rules

Chapter 4: Frequent Itemsets and Association Rules Chapter 4: Frequent Itemsets and Association Rules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision

More information

CS 412 Intro. to Data Mining

CS 412 Intro. to Data Mining CS 412 Intro. to Data Mining Chapter 6. Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Jiawei Han, Computer Science, Univ. Illinois at Urbana -Champaign, 2017 1 2 3

More information

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen Book Title Encyclopedia of Machine Learning Chapter Number 00403 Book CopyRight - Year 2010 Title Frequent Pattern Author Particle Given Name Hannu Family Name Toivonen Suffix Email hannu.toivonen@cs.helsinki.fi

More information

Associa'on Rule Mining

Associa'on Rule Mining Associa'on Rule Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 4 and 7, 2014 1 Market Basket Analysis Scenario: customers shopping at a supermarket Transaction

More information

Frequent Itemset Mining

Frequent Itemset Mining ì 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM COCONUT Team (PART I) IMAGINA 17/18 Webpage: http://www.lirmm.fr/~lazaar/teaching.html Email: lazaar@lirmm.fr 2 Data Mining ì Data Mining (DM) or Knowledge

More information

CSE4334/5334 Data Mining Association Rule Mining. Chengkai Li University of Texas at Arlington Fall 2017

CSE4334/5334 Data Mining Association Rule Mining. Chengkai Li University of Texas at Arlington Fall 2017 CSE4334/5334 Data Mining Assciatin Rule Mining Chengkai Li University f Texas at Arlingtn Fall 27 Assciatin Rule Mining Given a set f transactins, find rules that will predict the ccurrence f an item based

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

Machine Learning: Pattern Mining

Machine Learning: Pattern Mining Machine Learning: Pattern Mining Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008 Pattern Mining Overview Itemsets Task Naive Algorithm Apriori Algorithm

More information

Association Rule Mining on Web

Association Rule Mining on Web Association Rule Mining on Web What Is Association Rule Mining? Association rule mining: Finding interesting relationships among items (or objects, events) in a given data set. Example: Basket data analysis

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Tan, Steinbach, Kumar And Jiawei Han, Micheline Kamber, and Jian Pei 1 10

More information

Temporal Data Mining

Temporal Data Mining Temporal Data Mining Christian Moewes cmoewes@ovgu.de Otto-von-Guericke University of Magdeburg Faculty of Computer Science Department of Knowledge Processing and Language Engineering Zittau Fuzzy Colloquium

More information

Mining Infrequent Patter ns

Mining Infrequent Patter ns Mining Infrequent Patter ns JOHAN BJARNLE (JOHBJ551) PETER ZHU (PETZH912) LINKÖPING UNIVERSITY, 2009 TNM033 DATA MINING Contents 1 Introduction... 2 2 Techniques... 3 2.1 Negative Patterns... 3 2.2 Negative

More information

Association Analysis. Part 1

Association Analysis. Part 1 Association Analysis Part 1 1 Market-basket analysis DATA: A large set of items: e.g., products sold in a supermarket A large set of baskets: e.g., each basket represents what a customer bought in one

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #12: Frequent Itemsets Seoul National University 1 In This Lecture Motivation of association rule mining Important concepts of association rules Naïve approaches for

More information

Frequent Itemset Mining

Frequent Itemset Mining 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM IMAGINA 15/16 2 Frequent Itemset Mining: Motivations Frequent Itemset Mining is a method for market basket analysis. It aims at finding regulariges in

More information

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4 Principles of Knowledge Discovery in Data Fall 2004 Chapter 6: Mining Association Rules Dr. Osmar R. Zaïane University of Alberta Course Content Introduction to Data Mining Data warehousing and OLAP Data

More information

Frequent Itemset Mining

Frequent Itemset Mining ì 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM COCONUT Team IMAGINA 16/17 Webpage: h;p://www.lirmm.fr/~lazaar/teaching.html Email: lazaar@lirmm.fr 2 Data Mining ì Data Mining (DM) or Knowledge Discovery

More information

Knowledge Discovery and Data Mining I

Knowledge Discovery and Data Mining I Ludwig-Maximilians-Universität München Lehrstuhl für Datenbanksysteme und Data Mining Prof. Dr. Thomas Seidl Knowledge Discovery and Data Mining I Winter Semester 2018/19 Agenda 1. Introduction 2. Basics

More information

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Analytics Beyond OLAP. Prof. Yanlei Diao Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of

More information

Unit II Association Rules

Unit II Association Rules Unit II Association Rules Basic Concepts Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Frequent Itemset

More information

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

Basic Data Structures and Algorithms for Data Profiling Felix Naumann Basic Data Structures and Algorithms for 8.5.2017 Overview 1. The lattice 2. Apriori lattice traversal 3. Position List Indices 4. Bloom filters Slides with Thorsten Papenbrock 2 Definitions Lattice Partially

More information

FP-growth and PrefixSpan

FP-growth and PrefixSpan FP-growth and PrefixSpan n Challenges of Frequent Pattern Mining n Improving Apriori n Fp-growth n Fp-tree n Mining frequent patterns with FP-tree n PrefixSpan Challenges of Frequent Pattern Mining n Challenges

More information

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion Outline Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Introduction Algorithm Apriori Algorithm AprioriTid Comparison of Algorithms Conclusion Presenter: Dan Li Discussion:

More information

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border Guimei Liu a,b Jinyan Li a Limsoon Wong b a Institute for Infocomm Research, Singapore b School of

More information

Frequent Itemset Mining

Frequent Itemset Mining Frequent Itemset Mining prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Battling Size The previous time we saw that Big Data has

More information

Selecting the Right Interestingness Measure for Association Patterns

Selecting the Right Interestingness Measure for Association Patterns Selecting the Right ingness Measure for Association Patterns Pang-Ning Tan Department of Computer Science and Engineering University of Minnesota 2 Union Street SE Minneapolis, MN 55455 ptan@csumnedu Vipin

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation 12.3.2008 Lauri Lahti Association rules Techniques for data mining and knowledge discovery in databases

More information

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2) The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set

More information

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise Guimei Liu 1,2 Jinyan Li 1 Limsoon Wong 2 Wynne Hsu 2 1 Institute for Infocomm Research, Singapore 2 School

More information

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli

More information

Mining Positive and Negative Fuzzy Association Rules

Mining Positive and Negative Fuzzy Association Rules Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing

More information

Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F

Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F Exercise 1 min_sup = 0.3 Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F Items support Type ab 0.3 M ac 0.2 I ad 0.4 F ae 0.4 F bc 0.3 M bd 0.6 C be 0.4 F cd 0.4 C ce 0.2 I de 0.6 C Items support

More information

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Juliano Brito da Justa Neves 1 Marina Teresa Pires Vieira {juliano,marina}@dc.ufscar.br Computer Science Department

More information

OPPA European Social Fund Prague & EU: We invest in your future.

OPPA European Social Fund Prague & EU: We invest in your future. OPPA European Social Fund Prague & EU: We invest in your future. Frequent itemsets, association rules Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz

More information

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context.

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context. Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context. Murphy Choy Cally Claire Ong Michelle Cheong Abstract The rapid explosion in retail data calls for more effective

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization

More information

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany Mining Rank Data Sascha Henzgen and Eyke Hüllermeier Department of Computer Science University of Paderborn, Germany {sascha.henzgen,eyke}@upb.de Abstract. This paper addresses the problem of mining rank

More information

Frequent Pattern Mining: Exercises

Frequent Pattern Mining: Exercises Frequent Pattern Mining: Exercises Christian Borgelt School of Computer Science tto-von-guericke-university of Magdeburg Universitätsplatz 2, 39106 Magdeburg, Germany christian@borgelt.net http://www.borgelt.net/

More information

Density-Based Clustering

Density-Based Clustering Density-Based Clustering idea: Clusters are dense regions in feature space F. density: objects volume ε here: volume: ε-neighborhood for object o w.r.t. distance measure dist(x,y) dense region: ε-neighborhood

More information

Handling a Concept Hierarchy

Handling a Concept Hierarchy Food Electronics Handling a Concept Hierarchy Bread Milk Computers Home Wheat White Skim 2% Desktop Laptop Accessory TV DVD Foremost Kemps Printer Scanner Data Mining: Association Rules 5 Why should we

More information

sor exam What Is Association Rule Mining?

sor exam What Is Association Rule Mining? Mining Association Rules Mining Association Rules What is Association rule mining Apriori Algorithm Measures of rule interestingness Advanced Techniques 1 2 What Is Association Rule Mining? i Association

More information

Sequential Pattern Mining

Sequential Pattern Mining Sequential Pattern Mining Lecture Notes for Chapter 7 Introduction to Data Mining Tan, Steinbach, Kumar From itemsets to sequences Frequent itemsets and association rules focus on transactions and the

More information

Association Analysis. Part 2

Association Analysis. Part 2 Association Analysis Part 2 1 Limitations of the Support/Confidence framework 1 Redundancy: many of the returned patterns may refer to the same piece of information 2 Difficult control of output size:

More information

Constraint-Based Rule Mining in Large, Dense Databases

Constraint-Based Rule Mining in Large, Dense Databases Appears in Proc of the 5th Int l Conf on Data Engineering, 88-97, 999 Constraint-Based Rule Mining in Large, Dense Databases Roberto J Bayardo Jr IBM Almaden Research Center bayardo@alummitedu Rakesh Agrawal

More information

Mining Strong Positive and Negative Sequential Patterns

Mining Strong Positive and Negative Sequential Patterns Mining Strong Positive and Negative Sequential Patter NANCY P. LIN, HUNG-JEN CHEN, WEI-HUA HAO, HAO-EN CHUEH, CHUNG-I CHANG Department of Computer Science and Information Engineering Tamang University,

More information

Interestingness Measures for Data Mining: A Survey

Interestingness Measures for Data Mining: A Survey Interestingness Measures for Data Mining: A Survey LIQIANG GENG AND HOWARD J. HAMILTON University of Regina Interestingness measures play an important role in data mining, regardless of the kind of patterns

More information

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Claudio Lucchese Ca Foscari University of Venice clucches@dsi.unive.it Raffaele Perego ISTI-CNR of Pisa perego@isti.cnr.it Salvatore

More information

On Minimal Infrequent Itemset Mining

On Minimal Infrequent Itemset Mining On Minimal Infrequent Itemset Mining David J. Haglin and Anna M. Manning Abstract A new algorithm for minimal infrequent itemset mining is presented. Potential applications of finding infrequent itemsets

More information

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules 10/19/2017 MIST6060 Business Intelligence and Data Mining 1 Examples of Association Rules Association Rules Sixty percent of customers who buy sheets and pillowcases order a comforter next, followed by

More information

Approximate counting: count-min data structure. Problem definition

Approximate counting: count-min data structure. Problem definition Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem

More information

On Condensed Representations of Constrained Frequent Patterns

On Condensed Representations of Constrained Frequent Patterns Under consideration for publication in Knowledge and Information Systems On Condensed Representations of Constrained Frequent Patterns Francesco Bonchi 1 and Claudio Lucchese 2 1 KDD Laboratory, ISTI Area

More information

CPT+: A Compact Model for Accurate Sequence Prediction

CPT+: A Compact Model for Accurate Sequence Prediction CPT+: A Compact Model for Accurate Sequence Prediction Ted Gueniche 1, Philippe Fournier-Viger 1, Rajeev Raman 2, Vincent S. Tseng 3 1 University of Moncton, Canada 2 University of Leicester, UK 3 National

More information

Discovering Non-Redundant Association Rules using MinMax Approximation Rules

Discovering Non-Redundant Association Rules using MinMax Approximation Rules Discovering Non-Redundant Association Rules using MinMax Approximation Rules R. Vijaya Prakash Department Of Informatics Kakatiya University, Warangal, India vijprak@hotmail.com Dr.A. Govardhan Department.

More information

Interesting Patterns. Jilles Vreeken. 15 May 2015

Interesting Patterns. Jilles Vreeken. 15 May 2015 Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the Day What is interestingness? what is a pattern? and how can we mine interesting patterns? What is a pattern? Data Pattern y = x - 1 What

More information

ST3232: Design and Analysis of Experiments

ST3232: Design and Analysis of Experiments Department of Statistics & Applied Probability 2:00-4:00 pm, Monday, April 8, 2013 Lecture 21: Fractional 2 p factorial designs The general principles A full 2 p factorial experiment might not be efficient

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 05 Sequential Pattern Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

CS738 Class Notes. Steve Revilak

CS738 Class Notes. Steve Revilak CS738 Class Notes Steve Revilak January 2008 May 2008 Copyright c 2008 Steve Revilak. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

More information

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Sánchez 18th September 2013 Detecting Anom and Exc Behaviour on

More information

Theory of Dependence Values

Theory of Dependence Values Theory of Dependence Values ROSA MEO Università degli Studi di Torino A new model to evaluate dependencies in data mining problems is presented and discussed. The well-known concept of the association

More information