Association Analysis 1

Similar documents
COMP 5331: Knowledge Discovery and Data Mining

DATA MINING - 1DL360

Lecture Notes for Chapter 6. Introduction to Data Mining

DATA MINING - 1DL360

Data Mining Concepts & Techniques

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

CS 584 Data Mining. Association Rule Mining 2

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

CS 484 Data Mining. Association Rule Mining 2

Association Analysis Part 2. FP Growth (Pei et al 2000)

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example

Descrip<ve data analysis. Example. E.g. set of items. Example. Example. Data Mining MTAT (6EAP)

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Association Rules. Fundamentals

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

732A61/TDDD41 Data Mining - Clustering and Association Analysis

Data Mining and Analysis: Fundamental Concepts and Algorithms

Chapters 6 & 7, Frequent Pattern Mining

Selecting a Right Interestingness Measure for Rare Association Rules

CS5112: Algorithms and Data Structures for Applications

Association)Rule Mining. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management

Data mining, 4 cu Lecture 5:

Chapter 4: Frequent Itemsets and Association Rules

CS 412 Intro. to Data Mining

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Associa'on Rule Mining

Frequent Itemset Mining

CSE4334/5334 Data Mining Association Rule Mining. Chengkai Li University of Texas at Arlington Fall 2017

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Machine Learning: Pattern Mining

Association Rule Mining on Web

COMP 5331: Knowledge Discovery and Data Mining

Temporal Data Mining

Mining Infrequent Patter ns

Association Analysis. Part 1

Introduction to Data Mining

Frequent Itemset Mining

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

Frequent Itemset Mining

Knowledge Discovery and Data Mining I

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Unit II Association Rules

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

FP-growth and PrefixSpan

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

Frequent Itemset Mining

Selecting the Right Interestingness Measure for Association Patterns

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Positive and Negative Fuzzy Association Rules

Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

OPPA European Social Fund Prague & EU: We invest in your future.

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Frequent Pattern Mining: Exercises

Density-Based Clustering

Handling a Concept Hierarchy

sor exam What Is Association Rule Mining?

Sequential Pattern Mining

Association Analysis. Part 2

Constraint-Based Rule Mining in Large, Dense Databases

Mining Strong Positive and Negative Sequential Patterns

Interestingness Measures for Data Mining: A Survey

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

On Minimal Infrequent Itemset Mining

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

Approximate counting: count-min data structure. Problem definition

On Condensed Representations of Constrained Frequent Patterns

CPT+: A Compact Model for Accurate Sequence Prediction

Discovering Non-Redundant Association Rules using MinMax Approximation Rules

Interesting Patterns. Jilles Vreeken. 15 May 2015

ST3232: Design and Analysis of Experiments

Data Mining Concepts & Techniques

Data Mining and Analysis: Fundamental Concepts and Algorithms

DATA MINING - 1DL105, 1DL111

CS738 Class Notes. Steve Revilak

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Theory of Dependence Values

Transcription:

Association Analysis 1

Association Rules 1 3

Learning Outcomes Motivation Association Rule Mining Definition Terminology Apriori Algorithm Reducing Output: Closed and Maximal Itemsets 4

Overview One of the most popular problems in computer science! [Agrawal, Imielinski, Swami, SIGMOD 1993] 13th most cited across all computer science [Agrawal, Srikant, VLDB 1994] 15th most cited across all computer science 5

Overview Pattern Mining Unsupervised learning Useful for large datasets exploration: what is this data like? building global models Less suitable for well-studied and understood problem domains 6

Are humans inherently good at pattern mining? Is there a pattern? 7

Pattern or Illusion? 8

Pattern Mining Humans have always been doing pattern mining Observing & predicting nature The heavens (climate, navigation) Humans generally good at pattern recognition Illusion vs. pattern? We see what we want to see! (bias) Restricted to the natural dimensionality: 3D 9

Pattern vs. Chance Dog is not the pattern; the black patches are! But is that an interesting pattern? 10

What is a pattern? Repetitiveness Basically depends on counting Interestingness Avoid trivial patterns Chance occurrences Use statistical tests to weed these out 11

Motivation Frequent Item Set Mining is a method for market basket analysis. It aims at finding regularities in the shopping behavior of customers of supermarkets, mail-order companies, on-line shops etc. More specifically : Find sets of products that are frequently bought together. Possible applications of found frequent item sets: Improve arrangement of products in shelves, on a catalog's pages etc. Support cross-selling (suggestion of other products), product bundling. Fraud detection, technical dependence analysis etc. Often found patterns are expressed as association rules, for example: If a customer buys bread and wine, then she/he will probably also buy cheese. 12

Beer and Nappies - A Data Mining Urban Legend There is a story that a large supermarket chain, usually Wal- Mart, did an analysis of customers' buying habits and found a statistically significant correlation between purchases of beer and purchases of nappies (diapers in the US). It was theorized that the reason for this was that fathers were stopping off at Wal-Mart to buy nappies for their babies, and since they could no longer go down to the pub as often, would buy beer as well. As a result of this finding, the supermarket chain is alleged to have the nappies next to the beer, resulting in the increase of sales. Here's what really happened: "The analysis "did discover that between 5:00 and 7:00 p.m. that consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship by moving the products closer together on the shelves to increased sales of both. 13

Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket Transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality! 14

Definition: Frequent Itemset Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support count (σ) Frequency of occurrence of an itemset E.g. σ({milk, Bread,Diaper}) = 2 Support Fraction of transactions that contain an itemset E.g. s({milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 15

Definition: Association Rule Association Rule An implication expression of the form X Y, where X and Y are itemsets Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example: { Milk, Diaper} (Milk, Diaper,Beer) s = σ = T σ (Milk,Diaper,Beer) c = = σ (Milk,Diaper) 2 3 2 5 = Beer = 0.4 0.67 16

Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having support minsup threshold confidence minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! 17

Mining Association Rules TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements 18

Mining Association Rules Two-step approach: 1. Frequent Itemset Generation Generate all itemsets whose support minsup 2. Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive 19

Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2 d possible candidate itemsets 20

Frequent Itemset Generation Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database N TID Items Candidate Itemsets 001 Bread, Milk, Apples, Juice, Chips, Soda 002 Soda, Chips 003 Bread, Milk, Butter 004 Apples, Soda, Chips, Milk, Bread 005 Cheese, Chips, Soda 006 Bread, Milk, Chips 007 Apples, Chips, Soda 008 Crackers, Apples, Butter Bread Milk Apples Chips M W Match each transaction against every candidate itemset Complexity: O(NWM) expensive since M = 2 x -1 21

Given d unique items: Total number of itemsets = 2 d Total number of possible association rules: Computational Complexity 22 1 2 3 1 1 1 1 + = = + = = d d d k k d j j k d k d R If d=6, R = 602 rules If d = 10, R = 57002 rules If d = 15, R = 1.4 10 6 rules

Needle in a haystack!! 23

Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction 24

The Apriori Algorithm [Agrawal and Srikant 1994] 25

Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: X, Y : ( X Y ) s( X ) s( Y ) Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support 26

Illustrating Apriori Principle null A B C D E AB AC AD AE BC BD BE CD CE DE Found to be infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned supersets ABCDE 27

Database D TID 100 200 300 400 Items Illustrating Apriori Principle Initial Database Count Prune Database D C 1 L 1 Join Count Prune TID A C D B C E A B C E B E Items 100 A C D Itemset 200 B C E {A B} 300 {A C} A B C E 400 {A E} B E {B C} Itemset {B E} {B {C C E} E} Itemset {A} Itemset {B} {A B} {C} {A C} {D} {A {E} {B C} Itemset {B E} {B{C C E} E} Join Scan D Count Prune Scan D Sup 2 Sup3 1 3 2 1 1 3 2 Sup 3 2 Scan DC 2 L 2 C 3 L 3 Itemset {A} Itemset {B} {A{C} {B{E} C} {B E} Itemset {C E} {B C E} minsup = 0.5 Sup 2 Sup 3 23 23 3 Sup 2 2 28

Apriori Algorithm Method: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent 29

Factors Affecting Complexity Choice of minimum support threshold Lowering support threshold results in more frequent itemsets This may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set More space is needed to store support count of each item If number of frequent items also increases, both computation and I/O costs may also increase Size of database Since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width Transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width) 30

Reducing the Output: Closed and Maximal Item Sets 32

Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent null Maximal Itemsets A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Infrequent Itemsets ABCD E Border 33

Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset TID Items 1 {A,B} 2 {B,C,D} 3 {A,B,C,D} 4 {A,B,D} 5 {A,B,C,D} Itemset Support {A} 4 {B} 5 {C} 3 {D} 4 {A,B} 4 {A,C} 2 {A,D} 3 {B,C} 3 {B,D} 4 {C,D} 3 Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2 34

TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE Maximal vs Closed Itemsets null Transaction Ids 124 123 1234 245 345 A B C D E 12 124 24 4 123 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 12 2 24 4 4 2 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE Not supported by any transactions ABCDE 35

Minimum support = 2 Maximal vs Closed Frequent Itemsets null Closed but not maximal 124 123 1234 245 345 A B C D E Closed and maximal 12 124 24 4 123 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 12 2 24 4 4 2 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE 36

Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets 37

Summary Association Rule Mining Apriori Algorithm Closed and Maximal Itemsets Next Class Alternative methods FP- Tree ECLAT 38

Association Rules 2 39

Learning Outcomes Alternative methods to Apriori (Last Lecture) FP- Tree ECLAT Rule Generation 40

Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice General-to-specific vs Specific-to-general Frequent itemset border null null Frequent itemset border null............ {a 1,a 2,...,a n } {a 1,a 2,...,a n } Frequent itemset border {a 1,a 2,...,a n } (a) General-to-specific (b) Specific-to-general (c) Bidirectional 41

Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice - Equivalent Classes null null A B C D A B C D AB AC AD BC BD CD AB AC BC AD BD CD ABC ABD ACD BCD ABC ABD ACD BCD ABCD ABCD (a) Prefix tree (b) Suffix tree 42

Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice Breadth-first vs Depth-first (a) Breadth first (b) Depth first 43

Alternative Methods for Frequent Itemset Generation Representation of Database horizontal vs vertical data layout Horizontal Data Layout TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Vertical Data Layout A B C D E 1 1 2 2 1 4 2 3 4 3 5 5 4 5 6 6 7 8 9 7 8 9 8 10 9 44

Rule Generation 45

Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If L = k, then there are 2 k 2 candidate association rules (ignoring L and L) 46

Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(abc D) can be larger or smaller than c(ab D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(abc D) c(ab CD) c(a BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule 47

Lattice of rules Low Confidence Rule Rule Generation for Apriori Algorithm ABCD=>{ } BCD=>A ACD=>B ABD=>C ABC=>D CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD Pruned Rules D=>ABC C=>ABD B=>ACD A=>BCD 48

Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share the same prefix in the rule consequent join(cd=>ab,bd=>ac) would produce the candidate rule D => ABC CD=>AB BD=>AC Prune rule D=>ABC if its subset AD=>BC does not have high confidence D=>ABC 49

Apriori Apriori: uses a generate-and-test approach generates candidate itemsets and tests if they are frequent Generation of candidate itemsets is expensive (in both space and time) Support counting is expensive Subset checking (computationally expensive) Multiple Database scans (I/O) 50

The FP-Growth Algorithm Frequent Pattern Growth Algorithm [Han, Pei, and Yin 2000] 51

FP-Growth Algorithm FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach: Step 1: Build a compact data structure called the FP-tree Built using 2 passes over the data-set. Step 2: Extracts frequent itemsets directly from the FP-tree Traversal through FP-Tree 52

Finding Frequent Itemsets Input: The FP-tree Output: All Frequent Itemsets and their support Method: Divide and Conquer: Consider all itemsets that end in: E, D, C, B, A For each possible ending item, consider the itemsets with last items one of items preceding it in the ordering E.g, for E, consider all itemsets with last item D, C, B, A. This way we get all the itesets ending at DE, CE, BE, AE Proceed recursively this way. Do this for all items.

Frequent itemsets All Itemsets Ε D C B A DE CE BE AE CD BD AD BC AC AB CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C ACDE BCDE ABDE ABCE ABCD ABCDE

Frequent Itemsets All Itemsets Ε D C B A Frequent?; DE CE BE AE CD BD AD BC AC AB Frequent?; CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C Frequent? ACDE BCDE ABDE ABCE ABCD ABCDE Frequent?

Frequent Itemsets All Itemsets Frequent? Ε D C B A DE CE BE AE CD BD AD BC AC AB Frequent? Frequent? CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C ACDE BCDE ABDE ABCE ABCD ABCDE Frequent? Frequent?

Frequent Itemsets All Itemsets Frequent? Ε D C B A DE CE BE AE CD BD AD BC AC AB Frequent? CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C Frequent? ACDE BCDE ABDE ABCE ABCD ABCDE We can generate all itemsets this way We expect the FP-tree to contain a lot less

Using the FP-tree to find frequent TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Header table Item Pointer A B C D E Transaction Database C:3 B:5 itemsets A:7 C:1 null E:1 E:1 B:3 C:3 Bottom-up traversal of the tree. E:1 First, itemsets ending in E, then D, etc, each time a suffix-based class

Finding Frequent Itemsets Subproblem: find frequent itemsets ending in E null A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1 We will then see how to compute the support for the possible itemsets

Finding Frequent Itemsets null Ending in D A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

Finding Frequent Itemsets null Ending in C A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

Finding Frequent Itemsets null Ending in B A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

Finding Frequent Itemsets Ending in Α null A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

Algorithm For each suffix X Phase 1 Construct the prefix tree for X as shown before, and compute the support using the header table and the pointers Phase 2 If X is frequent, construct the conditional FP-tree for X in the following steps 1. Recompute support 2. Prune infrequent items 3. Prune leaves and recurse

Example Phase 1 construct prefix tree null Find all prefix paths that contain E A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1 Suffix Paths for Ε: {A,C,D,E}, {A,D,Ε}, {B,C,E}

Example Phase 1 construct prefix tree Find all prefix paths that contain E null A:7 B:3 C:1 C:3 E:1 E:1 E:1 Prefix Paths for Ε: {A,C,D,E}, {A,D,Ε}, {B,C,E}

Example Compute Support for E (minsup = 2) How? Follow pointers while summing up counts: 1+1+1 = 3 > 2 E is frequent null A:7 B:3 C:1 C:3 E:1 E:1 E:1 {E} is frequent so we can now consider suffixes DE, CE, BE, AE

Example E is frequent so we proceed with Phase 2 Phase 2 Convert the prefix tree of E into a conditional FP-tree Two changes (1) Recompute support (2) Prune infrequent null A:7 B:3 C:3 C:1 E:1 E:1 E:1

Example Recompute Support null A:7 B:3 The support counts for some of the nodes include transactions that do not end in E For example in null->b->c->e we count {B, C} C:1 C:3 The support of any node is equal to the sum of the support of leaves with label E in its subtree E:1 E:1 E:1

Example null A:7 B:3 C:1 C:3 E:1 E:1 E:1

Example null A:7 B:3 C:1 C:1 E:1 E:1 E:1

Example null A:7 B:1 C:1 C:1 E:1 E:1 E:1

Example null A:7 B:1 C:1 C:1 E:1 E:1 E:1

Example null A:7 B:1 C:1 C:1 E:1 E:1 E:1

Example null A:2 B:1 C:1 C:1 E:1 E:1 E:1

Example null A:2 B:1 C:1 C:1 E:1 E:1 E:1

Example null Truncate A:2 B:1 Delete the nodes of Ε C:1 C:1 E:1 E:1 E:1

Example null Truncate A:2 B:1 Delete the nodes of Ε C:1 C:1 E:1 E:1 E:1

Example null Truncate A:2 B:1 Delete the nodes of Ε C:1 C:1

Example null Prune infrequent A:2 B:1 In the conditional FPtree some nodes may have support less than minsup C:1 C:1 e.g., B needs to be pruned This means that B appears with E less than minsup times

Example null A:2 B:1 C:1 C:1

Example null A:2 C:1 C:1

Example null A:2 C:1 C:1 The conditional FP-tree for E Repeat the algorithm for {D, E}, {C, E}, {A, E}

Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain D (DE) in the conditional FP-tree

Example null A:2 C:1 Phase 1 Find all prefix paths that contain D (DE) in the conditional FP-tree

Example null A:2 C:1 Compute the support of {D,E} by following the pointers in the tree 1+1 = 2 2 = minsup {D,E} is frequent

Example null A:2 C:1 Phase 2 Construct the conditional FP-tree 1. Recompute Support 2. Prune nodes

Example null A:2 Recompute support C:1

Example null A:2 Prune nodes C:1

Example null A:2 Prune nodes C:1

Example null A:2 Prune nodes C:1 Small support

Example null A:2 Final condition FP-tree for {D,E} The support of A is minsup so {A,D,E} is frequent Since the tree has a single node we return to the next subproblem.

Example null A:2 C:1 C:1 The conditional FP-tree for E We repeat the algorithm for {D,E}, {C,E}, {A,E}

Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain C (CE) in the conditional FP-tree

Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain C (CE) in the conditional FP-tree

Example null A:2 C:1 C:1 Compute the support of {C,E} by following the pointers in the tree 1+1 = 2 2 = minsup {C,E} is frequent

Example null A:2 C:1 C:1 Phase 2 Construct the conditional FP-tree 1. Recompute Support 2. Prune nodes

Example null A:1 C:1 Recompute support C:1

Example null A:1 C:1 Prune nodes C:1

Example null A:1 Prune nodes

Example null A:1 Prune nodes

Example null Prune nodes Return to the previous subproblem

Example null A:2 C:1 C:1 The conditional FP-tree for E We repeat the algorithm for {D,E}, {C,E}, {A,E}

Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain A (AE) in the conditional FP-tree

Example null A:2 Phase 1 Find all prefix paths that contain A (AE) in the conditional FP-tree

Example null A:2 Compute the support of {A,E} by following the pointers in the tree 2 minsup {A,E} is frequent There is no conditional FP-tree for {A,E}

Example So for E we have the following frequent itemsets {E}, {D,E}, {C,E}, {A,E} We proceed with D

Example Ending in D A:7 null B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

Example null Phase 1 construct prefix tree Find all prefix paths that contain D A:7 B:3 Support 5 > minsup, D is frequent Phase 2 Convert prefix tree into conditional FP-tree C:3 B:5 C:1 C:3

Example null A:7 B:3 B:5 C:1 C:3 C:1 Recompute support

Example null A:7 B:3 B:2 C:1 C:3 C:1 Recompute support

Example null A:3 B:3 B:2 C:1 C:3 C:1 Recompute support

Example null A:3 B:3 B:2 C:1 C:1 C:1 Recompute support

Example null A:3 B:1 B:2 C:1 C:1 C:1 Recompute support

Example null A:3 B:1 B:2 C:1 C:1 C:1 Prune nodes

Example null A:3 B:1 B:2 C:1 C:1 C:1 Prune nodes

Example null A:3 B:1 B:2 C:1 C:1 C:1 Construct conditional FP-trees for {C,D}, {B,D}, {A,D} And so on.

Observations At each recursive step we solve a subproblem Construct the prefix tree Compute the new support Prune nodes Subproblems are disjoint so we never consider the same itemset twice Support computation is efficient happens together with the computation of the frequent itemsets.

Observations The efficiency of the algorithm depends on the compaction factor of the dataset If the tree is bushy then the algorithm does not work well, it increases a lot of number of subproblems that need to be solved.

FP-Tree size The FP-Tree usually has a smaller size than the uncompressed data typically many transactions share items (and hence prefixes). Best case scenario: all transactions contain the same set of items. 1 path in the FP-tree Worst case scenario: every transaction has a unique set of items (no items in common) Size of the FP-tree is at least as large as the original data. Storage requirements for the FP-tree are higher need to store the pointers between the nodes and the counters. The size of the FP-tree depends on how the items are ordered Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic). 120

Discussion Advantages of FP-Growth only 2 passes over dataset compresses dataset no candidate generation much faster than Apriori Disadvantages of FP-Growth FP-Tree may not fit in memory!! FP-Tree is expensive to build Trade-off: takes time to build, but once it is built, frequent itemsets are read off easily. Time is wasted (especially if support threshold is high), as the only pruning that can be done is on single items. support can only be calculated once the entire dataset is added to the FP-Tree. 121

Should the FP-Growth be used in favor of Apriori? Does FP-Growth have a better complexity than Apriori? Does FP-Growth always have a better running time than Apriori? More relevant questions What kind of data is it? What kind of results do I want? How will I analyze the resulting rules? 122

FP Growth vs Apriori 100 90 80 D1 FP-grow th runtime D1 Apriori runtime Run time(sec.) 70 60 50 40 30 20 10 0 Data set T25I20D10K 0 0.5 1 1.5 2 2.5 3 Support threshold(%) 123

The Eclat Algorithm [Zaki, Parthasarathy, Ogihara, and Li 1997] 124

ECLAT For each item, store a list of transaction ids (tids) Horizontal Data Layout TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Vertical Data Layout A B C D E 1 1 2 2 1 4 2 3 4 3 5 5 4 5 6 6 7 8 9 7 8 9 8 10 9 TID-list 128

ECLAT Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. A 1 4 5 6 7 8 9 B 1 2 5 7 8 10 3 traversal approaches: top-down, bottom-up and hybrid Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory AB 1 5 7 8 129

Partitioning The set of transactions may be divided into a number of disjoint subsets. Then each partition is searched for frequent itemsets. These frequent itemsets are called local frequent itemsets. How can information about local itemsets be used in finding frequent itemsets of the global set of transactions? Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB. 130

Partitioning Phase 1 Divide n transactions into m partitions Find the frequent itemsets in each partition Combine all local frequent itemsets to form candidate itemsets Phase 2 Find global frequent itemsets December 2008 GKGupta 131

Sampling A random sample (usually large enough to fit in the main memory) may be obtained from the overall set of transactions and the sample is searched for frequent itemsets. These frequent itemsets are called sample frequent itemsets. How can information about sample itemsets be used in finding frequent itemsets of the global set of transactions? 132

Sampling Not guaranteed to be accurate but we sacrifice accuracy for efficiency. A lower support threshold may be used for the sample to ensure not missing any frequent datasets. The actual frequencies of the sample frequent itemsets are then obtained. More than one sample could be used to improve accuracy. 133

Summary We looked at 2 alternative methods FP- Tree ECLAT Rule Generation Phase 134

Association Rule 3 135

Learning Outcomes Pattern Evaluation Advance Pattern Mining Techniques 136

Pattern Evaluation 137

The Knowledge Discovery Process Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad et al. 1996) Focus on the quality of discovered patterns independent of the data mining algorithm This definition is often quoted, but not very seriously taken into account

Interestingness of Discovered Patterns useful Amount of Research novel, surprising comprehensible valid (accurate) Difficulty of measurement 139

Selecting the Most Interesting Rules in a Post-Processing Step Goal: select a subset of interesting rules out of all discovered rules, in order to show to the user just the most interesting rules User-driven, subjective approach Uses the domain knowledge, believes or preferences of the user Data-driven, objective approach Based on statistical properties of discovered patterns More generic, independent of the application domain and user

Data-driven vs user-driven rule interestingness measures all rules all rules Note: user-driven data-driven measure selected rules user user-driven measure selected rules measures real user interest believes, Dom. Knowl. user

Data-driven Approach Objective Measures 142

Data-driven approaches Two broad groups of approaches: Approach based on the statistical strength of patterns entirely data-driven, no user-driven aspect Based on the assumption that a certain kind of pattern is surprising to any user, in general But still independent from the application domain and from believes / domain knowledge of the target user So, mainly data-driven

Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used 144

Application of Interestingness Measure Interestingness Measures 145

Computing Interestingness Measure Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Y Y X f 11 f 10 f 1+ X f 01 f 00 f o+ f +1 f +0 T f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, etc. 146

Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee Tea) = 0.9375 147

Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(S B) = 420/1000 = 0.42 P(S) P(B) = 0.6 0.7 = 0.42 P(S B) = P(S) P(B) => Statistical independence P(S B) > P(S) P(B) => Positively correlated P(S B) < P(S) P(B) => Negatively correlated 148

Measures that take into account statistical dependence Statistical-based Measures 149 )] ( )[1 ( )] ( )[1 ( ) ( ) ( ), ( ) ( ) ( ), ( ) ( ) ( ), ( ) ( ) ( Y P Y P X P X P Y P X P Y X P coefficient Y P X P Y X P PS Y P X P Y X P Interest Y P X Y P Lift = = = = φ

Example: Lift/Interest Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated) 150

Drawback of Lift & Interest Y Y X 10 0 10 X 0 90 90 10 90 100 Y Y X 90 0 90 X 0 10 10 90 10 100 0.1 Lift = = 10 0.9 Lift = = 1. 11 (0.1)(0.1) (0.9)(0.9) Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 151

There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Aprioristyle support based pruning? How does it affect these measures? 152

Properties of A Good Measure Piatetsky-Shapiro: 3 properties a good measure M must satisfy: M(A,B) = 0 if A and B are statistically independent M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged 153

Comparing Different Measures 10 examples of contingency tables: Rankings of contingency tables using various measures: Example f 11 f 10 f 01 f 00 E1 8123 83 424 1370 E2 8330 2 622 1046 E3 9481 94 127 298 E4 3954 3080 5 2961 E5 2886 1363 1320 4431 E6 1500 2000 500 6000 E7 4000 2000 1000 3000 E8 4000 2000 2000 2000 E9 1720 7121 5 1154 E10 61 2483 4 7452 154

Property under Variable Permutation B B A p q A r s A A B p r B q s Does M(A,B) = M(B,A)? Symmetric measures: support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: confidence, conviction, Laplace, J-measure, etc 155

Property under Row/Column Scaling Grade-Gender Example (Mosteller, 1968): Male Female High 2 3 5 Low 1 4 5 3 7 10 Male Female High 4 30 34 Low 2 40 42 6 70 76 2x 10x Mosteller: Underlying association should be independent of the relative number of male and female students in the samples 156

Example: φ-coefficient φ-coefficient is analogous to correlation coefficient for continuous variables Y Y X 60 10 70 X 10 20 30 70 30 100 Y Y X 20 10 30 X 10 60 70 30 70 100 φ = = 0.6 0.7 0.7 0.7 0.3 0.7 0.3 0.5238 φ = 0.2 0.3 0.3 0.7 0.3 0.7 0.3 0.5238 φ Coefficient is the same for both tables = 157

Property under Null Addition B B A p q A r s B B A p q A r s + k Invariant measures: support, cosine, Jaccard, etc Non-invariant measures: correlation, Gini, mutual information, odds ratio, etc 158

Different Measures have Different Properties Symbol Measure Range P1 P2 P3 O1 O2 O3 O3' O4 Φ Correlation -1 0 1 Yes Yes Yes Yes No Yes Yes No λ Lambda 0 1 Yes No No Yes No No* Yes No α Odds ratio 0 1 Yes* Yes Yes Yes Yes Yes* Yes No Q Yule's Q -1 0 1 Yes Yes Yes Yes Yes Yes Yes No Y Yule's Y -1 0 1 Yes Yes Yes Yes Yes Yes Yes No κ Cohen's -1 0 1 Yes Yes Yes Yes No No Yes No M Mutual Information 0 1 Yes Yes Yes Yes No No* Yes No J J-Measure 0 1 Yes No No No No No No No G Gini Index 0 1 Yes No No No No No* Yes No s Support 0 1 No Yes No Yes No No No No c Confidence 0 1 No Yes No Yes No No No Yes L Laplace 0 1 No Yes No Yes No No No No V Conviction 0.5 1 No Yes No Yes** No No Yes No I Interest 0 1 Yes* Yes Yes Yes No No No No IS IS (cosine) 0.. 1 No Yes Yes Yes No No No Yes PS Piatetsky-Shapiro's -0.25 0 0.25 Yes Yes Yes Yes No Yes Yes No F Certainty factor -1 0 1 Yes Yes Yes No No No Yes No AV Added value 0.5 1 1 Yes Yes Yes No No No No No S Collective strength 0 1 No Yes Yes Yes No Yes* Yes No ζ Jaccard 0.. 1 No Yes Yes Yes No No No Yes 2 1 2 K Klosgen's 1 2 3 0 Yes Yes Yes No No No No No 3 3 3 3 159

Support-based Pruning Most of the association rule mining algorithms use support measure to prune rules and itemsets Study effect of support pruning on correlation of itemsets Generate 10000 random contingency tables Compute support and pairwise correlation for each table Apply support-based pruning and examine the tables that are removed 160

Effect of Support-based Pruning All Itempairs 1000 900 800 700 600 500 400 300 200 100 0-1 -0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0 0.1 0.2 Correlation 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 161

Effect of Support-based Pruning Support < 0.01 Support < 0.03 300 300 250 250 200 200 150 150 100 50 0-1 -0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Correlation 1 100 50 0 300-1 -0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0 Correlation Support < 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Support-based pruning eliminates mostly negatively correlated itemsets 250 200 150 100 50 0-1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 Correlation 162

Effect of Support-based Pruning Investigate how support-based pruning affects other measures Steps: Generate 10000 contingency tables Rank each table according to the different measures Compute the pair-wise correlation between the measures 163

Effect of Support-based Pruning Without Support Pruning (All Pairs) Conviction Odds ratio Col Strength Correlation Interest PS CF Yule Y Reliability Kappa Klosgen Yule Q Confidence Laplace IS Support Jaccard Lambda Gini J-measure Mutual Info All Pairs (40.14%) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Red cells indicate correlation between the pair of measures > 0.85 40.14% pairs have correlation > 0.85 Jaccard 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-1 -0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 Correlation Scatter Plot between Correlation & Jaccard Measure 164

Effect of Support-based Pruning 0.5% support 50% Interest Conviction Odds ratio 0.005 <= support <= 0.500 (61.45%) 1 0.9 Col Strength Laplace Confidence Correlation Klosgen Reliability PS Yule Q CF Yule Y Kappa IS Jaccard Support Lambda Gini J-measure Mutual Info 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 61.45% pairs have correlation > 0.85 Jaccard 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-1 -0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 Correlation Scatter Plot between Correlation & Jaccard Measure: 165

Effect of Support-based Pruning 0.5% support 30% Support Interest Reliability 0.005 <= support <= 0.300 (76.42%) 1 0.9 Conviction Yule Q Odds ratio Confidence CF Yule Y Kappa Correlation Col Strength IS Jaccard Laplace PS Klosgen Lambda Mutual Info Gini J-measure 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 76.42% pairs have correlation > 0.85 Jaccard 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-0.4-0.2 0 0.2 0.4 0.6 0.8 1 Correlation Scatter Plot between Correlation & Jaccard Measure 166

Rule interestingness measures based on the statistical strength of the patterns More than 50 measures have been called rule interestingness measures in the literature The vast majority of these measures are based only on statistical strength or mathematical properties (Hilderman & Hamilton 2001), (Tan et al. 2002) Are they correlated with real human interest??

Evaluating the correlation between data-driven measures and real human interest 3-step Methodology: Rank discovered rules according to each of a number of datadriven rule interestingness measures Show (a subset of) discovered rules to the user, who evaluates how interesting each rule is Measure the linear correlation between the ranking of each datadriven measure and real human interest Rules are evaluated by an expert user, who also provided the dataset for the data miner

The Study (Carvalho et al. 2005) 8 datasets, one user for each dataset 11 data-driven rule interestingness measures Each data-driven measure was evaluated over 72 <user-rule> pairs (8 users 9 rules/dataset) 88 correlation values (8 users 11 measures) 31 correlation values (35%) were 0.6 The correlations associated with each measure varied considerably across the 8 datasets/users

Data-driven approaches based on finding specific kinds of surprising patterns Principle: a few special kinds of patterns tend to be surprising to most users, regardless of the meaning of the attributes in the pattern and the application domain Does not require domain knowledge or believes specified by user We discuss two approaches: Discovery of exception rules contradicting general rules Detection of Simpson s paradox

Selecting Surprising Rules Which Are Exceptions of a General Rule Data-driven approach: consider the following 2 rules R 1 : IF Cond 1 THEN Class 1 (general rule) R 2 : IF Cond 1 AND Cond 2 THEN Class 2 (exception) Cond 1, Cond 2 are conjunctions of conditions Rule R 2 is a specialization of rule R 1 Rules R 1 and R 2 predict different classes An exception rule is interesting if both the rule and its general rule have a high predictive accuracy (Suzuki & Kodratoff 1998), (Suzuki 2004)

Selecting Surprising Exception Rules an Example Mining data about car accidents (Suzuki 2004) IF (used_seat_belt = yes) THEN (injury = no) (general rule) IF (used_seat_belt = yes) AND (passenger = child) THEN (injury = yes) (exception rule) If both rules are accurate, then the exception rule would be considered interesting (it is both accurate and surprising)

Simpson Paradox What s the confidence of the following rules: (rule 1) {HDTV=Yes} {Exercise machine = Yes} (rule 2) {HDTV=No} {Exercise machine = Yes}? Confidence of rule 1 = 99/180 = 55% Confidence of rule 2 = 54/120 = 45% So, Customers who buy high-definition televisions are more likely to buy exercise machines that those who don t buy high-definition televisions. Right? Well, maybe not

Stratification: Simpson paradox Consider this more detailed table: What s the confidence of the rules for each strata: (rule 1) {HDTV=Yes} {Exercise machine = Yes} (rule 2) {HDTV=No} {Exercise machine = Yes}? College students: Confidence of rule 1 = 1/10 = 10% Confidence of rule 2 = 4/34 = 11.8% Working Adults: Confidence of rule 1 = 98/170 = 57.7% Confidence of rule 2 = 50/86 = 58.1% The rules suggest that, for each group, customers who don t buy HDTV are more likely to buy exercise machines, which contradict the previous conclusion when data from the two customer groups are pooled together.

Importance of Stratification The lesson here is that proper stratification is needed to avoid generating spurious patterns resulting from Simpson's paradox. For example Market basket data from a major supermarket chain should be stratified according to store locations, while Medical records from various patients should be stratified according to confounding factors such as age and gender.

Effect of Support Distribution Many real data sets have skewed support distribution where most of the items have relatively low to moderate frequencies, but a small number of them have very high frequencies.

Skewed distribution Tricky to choose the right support threshold for mining such data sets. If we set the threshold too high (e.g., 20%), then we may miss many interesting patterns involving the low support items from G1. Such low support items may correspond to expensive products (such as jewelry) that are seldom bought by customers, but whose patterns are still interesting to retailers. Conversely, when the threshold is set too low, there is the risk of generating spurious patterns that relate a high frequency item such as milk to a low frequency item such as caviar.

Cross-support patterns Cross-support patterns are those that relate a high frequency item such as milk to a low frequency item such as caviar. Likely to be spurious because their correlations tend to be weak. E.g. the confidence of {caviar} {milk} is likely to be high, but still the pattern is spurious, since there isn t probably any correlation between caviar and milk. However, we don t want to use Lift during the computation of frequent itemsets because it doesn t have the anti-monotone property. Lift is rather used as a post-processing step. So, we want to detect cross-support pattern by looking at some antimonotone property.

Cross-support patterns Definition A cross-support pattern is an itemset X = {i 1, i 2,, i k } whose support ratio min r( X ) = max { s( i1 ), s( i2 ),..., s( ik )} { s( i ), s( i ),..., s( i )} is less than a user-specified threshold h c. 1 2 k Example Suppose the support for milk is 70%, while the support for sugar is 10% and caviar is 0.04% Given h c = 0.01, the frequent itemset {milk, sugar, caviar} is a crosssupport pattern because its support ratio is r = min {0.7, 0.1, 0.0004} / max {0.7, 0.1, 0.0004} = 0.0004 / 0.7 = 0.00058 < 0.01

Detecting cross-support patterns E.g. assuming that h c = 0.3, the itemsets {p,q}, {p,r}, and {p,q,r} are cross-support patterns. Because their support ratios, being equal to 0.2, are less than threshold h c. We can apply a high support threshold, say, 20%, to eliminate the cross-support patterns but, this may come at the expense of discarding other interesting patterns such as the strongly correlated itemset {q,r} that has support equal to 16.7%.

Detecting cross-support patterns Confidence pruning also doesn t help. Confidence for {q} {p} is 80% even though {p, q} is a crosssupport pattern. Meanwhile, rule {q} {r} also has high confidence even though {q, r} is not a crosssupport pattern. These demonstrate the difficulty of using the confidence measure to distinguish between rules extracted from cross-support and noncross-support patterns. 181

Lowest confidence rule Notice that the rule {p} {q} has very low confidence because most of the transactions that contain p do not contain q. This observation suggests that: Cross-support patterns can be detected by examining the lowest confidence rule that can be extracted from a given itemset.

Finding lowest confidence Recall the anti-monotone property of confidence: conf( i 1,i 2 i 3,i 4,,i k ) conf( i 1,i 2, i 3 i 4,,i k ) This property suggests that confidence never increases as we shift more items from the left to the right-hand side of an association rule. Hence, the lowest confidence rule that can be extracted from a frequent itemset contains only one item on its left-hand side.

Finding lowest confidence Given a frequent itemset {i 1,i 2,i 3,i 4,,i k }, the rule i j i 1,i 2, i 3, i j-1, i j+1, i 4,,i k has the lowest confidence if s(i j ) = max {s(i 1 ), s(i 2 ),,s(i k )} Follows directly from the definition of confidence as the ratio between the rule's support and the support of the rule antecedent.

Finding lowest confidence Summarizing, the lowest confidence attainable from a frequent itemset {i 1,i 2,i 3,i 4,,i k }, is s( { i1, i2,..., ik }) max{ s( i ), s( i ),..., s( i )} 1 2 This is also known as the h-confidence measure or all-confidence measure. k

and thus can be incorporated directly into h-confidence Clearly, cross-support patterns can be eliminated by ensuring that the h-confidence values for the patterns exceed some threshold hc. Observe that the measure is also antimonotone, i.e., hconfidence({i1,i2,, ik}) h- confidence({i1,i2,, ik+1 })

User Driven Approach Subjective Approach 187

Subjective Interestingness Measure Objective measure: Rank patterns based on statistics computed from data Subjective measure: Rank patterns according to user s interpretation A pattern is subjectively interesting if it contradicts the expectation of a user (Silberschatz & Tuzhilin) A pattern is subjectively interesting if it is actionable (Silberschatz & Tuzhilin) 188

Interestingness via Unexpectedness Need to model expectation of users (domain knowledge) + Pattern expected to be frequent - Pattern expected to be infrequent Pattern found to be frequent Pattern found to be infrequent + - - Expected Patterns + Unexpected Patterns Need to combine expectation of users with evidence from data (i.e., extracted patterns) 189

Selecting Interesting Association Rules based on User-Defined Templates User-driven approach: the user specifies templates for interesting and non-interesting rules Inclusive templates specify interesting rules Restrictive templates specify non-interesting rules A rule is considered interesting if it matches at least one inclusive template and it matches no restrictive template (Klemettinen et al. 1994)

Selecting Interesting Association Rules Hierarchies of items and their categories food drink clothes rice fish hot dog coke wine water trousers socks Inclusive template: IF (food) THEN (drink) Restrictive template: IF (hot dog) THEN (coke) Rule IF (fish) THEN (wine) is interesting

Selecting Surprising Rules based on General Impressions User-driven approach: the user specifies her/his general impressions about the application domain (Liu et al. 1997, 2000); (Romao et al. 2004) Example: IF (salary = high) THEN (credit = good) (different from reasonably precise knowledge, e.g. IF salary > 35,500 ) Discovered rules are matched with the user s general impressions in a post-processing step, in order to determine the most surprising (unexpected) rules

Selecting Surprising Rules based on General Impressions Case I rule with unexpected consequent A rule and a general impression have similar antecedents but different consequents Case II rule with unexpected antecedent A rule and a general impression have similar consequents but the rule antecedent is different from the antecedent of all general impressions In both cases the rule is considered unexpected

Selecting Surprising Rules based on General Impressions GENERAL IMPRESSION: IF (salary = high) AND (education_level = high) THEN (credit = good) UNEXPECTED RULES: IF (salary > 30,000) AND (education BSc) AND (mortage = yes) THEN (credit = bad) /* Case I - unexpected consequent */ IF (mortgage = no) AND (C/A_balance > 10,000) THEN (credit = good) /* Case II unexpected antecedent */ /* assuming no other general impression has a similar antecedent */

Summary Association Rule Induction is a Two Step Process Find the frequent item sets (minimum support). Form the relevant association rules (minimum confidence). Generating the Association Rules Form all possible association rules from the frequent item sets. Filter interesting" association rules 195

Readings Introduction to Data Mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar - Chapter 6 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT 96. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD 00. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94. S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97. E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE 03. Charu C. Aggarwal and Jiawei Han. 2014. Frequent Pattern Mining. Springer Publishing Company, Incorporated. 196