CSE 5243 INTRO. TO DATA MINING

Similar documents
CSE 5243 INTRO. TO DATA MINING

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

CS 412 Intro. to Data Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Chapters 6 & 7, Frequent Pattern Mining

COMP 5331: Knowledge Discovery and Data Mining

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

Data Mining Concepts & Techniques

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

CS 484 Data Mining. Association Rule Mining 2

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Lecture Notes for Chapter 6. Introduction to Data Mining

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

732A61/TDDD41 Data Mining - Clustering and Association Analysis

DATA MINING - 1DL360

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Associa'on Rule Mining

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Unit II Association Rules

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

DATA MINING - 1DL360

Data Mining and Analysis: Fundamental Concepts and Algorithms

CS5112: Algorithms and Data Structures for Applications

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

CSE 5243 INTRO. TO DATA MINING

COMP 5331: Knowledge Discovery and Data Mining

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Association Rule Mining on Web

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Discovering Non-Redundant Association Rules using MinMax Approximation Rules

Association Analysis. Part 1

Machine Learning: Pattern Mining

Association Analysis 1

CSE 5243 INTRO. TO DATA MINING

CS 584 Data Mining. Association Rule Mining 2

FP-growth and PrefixSpan

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Introduction to Data Mining

Chapter 6: Mining Frequent Patterns, Association and Correlations

Maintaining Frequent Itemsets over High-Speed Data Streams

Knowledge Discovery and Data Mining I

On Condensed Representations of Constrained Frequent Patterns

Mining Positive and Negative Fuzzy Association Rules

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Association Analysis Part 2. FP Growth (Pei et al 2000)

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Density-Based Clustering

Selecting a Right Interestingness Measure for Rare Association Rules

Chapter 4: Frequent Itemsets and Association Rules

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Mining Free Itemsets under Constraints

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Association Analysis. Part 2

Mining Infrequent Patter ns

Frequent Itemset Mining

Finding Association Rules that Trade Support Optimally Against Confidence

Handling a Concept Hierarchy

Constraint-Based Rule Mining in Large, Dense Databases

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

Mining Strong Positive and Negative Sequential Patterns

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

OPPA European Social Fund Prague & EU: We invest in your future.

An Approach to Classification Based on Fuzzy Association Rules

NEGATED ITEMSETS OBTAINING METHODS FROM TREE- STRUCTURED STREAM DATA

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

Data mining, 4 cu Lecture 5:

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context.

Quantitative Association Rule Mining on Weighted Transactional Data

On Minimal Infrequent Itemset Mining

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

Sequential Pattern Mining

Data Mining Concepts & Techniques

Approach for Rule Pruning in Association Rule Mining for Removing Redundancy

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases

sor exam What Is Association Rule Mining?

A Concise Representation of Association Rules using Minimal Predictive Rules

1 Frequent Pattern Mining

Data Warehousing & Data Mining

Frequent Itemset Mining

Association Rules. Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION. Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION

Data Warehousing & Data Mining

Transcription:

CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU

2 Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Basic Concepts Efficient Pattern Mining Methods Pattern Evaluation Summary

3 Pattern Discovery: Basic Concepts What Is Pattern Discovery? Why Is It Important? Basic Concepts: Frequent Patterns and Association Rules Compressed Representation: Closed Patterns and Max-Patterns

4 What Is Pattern Discovery? Motivation examples: What products were often purchased together? What are the subsequent purchases after buying an ipad? What code segments likely contain copy-and-paste bugs? What word sequences likely form phrases in this corpus?

5 What Is Pattern Discovery? Motivation examples: What products were often purchased together? What are the subsequent purchases after buying an ipad? What code segments likely contain copy-and-paste bugs? What word sequences likely form phrases in this corpus? What are patterns? Patterns: A set of items, subsequences, or substructures that occur frequently together (or strongly correlated) in a data set Patterns represent intrinsic and important properties of datasets

What Is Pattern Discovery? Motivation examples: What products were often purchased together? What are the subsequent purchases after buying an ipad? What code segments likely contain copy-and-paste bugs? What word sequences likely form phrases in this corpus? What are patterns? 6 Patterns: A set of items, subsequences, or substructures that occur frequently together (or strongly correlated) in a data set Patterns represent intrinsic and important properties of datasets Pattern discovery: Uncovering patterns from massive data sets

Pattern Discovery: Why Is It Important? Finding inherent regularities in a data set Foundation for many essential data mining tasks Association, correlation, and causality analysis Mining sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: Discriminative pattern-based analysis Cluster analysis: Pattern-based subspace clustering 7

Pattern Discovery: Why Is It Important? Finding inherent regularities in a data set Foundation for many essential data mining tasks Association, correlation, and causality analysis Mining sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: Discriminative pattern-based analysis Cluster analysis: Pattern-based subspace clustering Broad applications Market basket analysis, cross-marketing, catalog design, sale campaign analysis, Web log analysis, biological sequence analysis 8

9 Basic Concepts: k-itemsets and Their Supports Itemset: A set of one or more items

10 Basic Concepts: k-itemsets and Their Supports Itemset: A set of one or more items k-itemset: X = {x 1,, x k } Ex. {Beer, Nuts, Diaper} is a 3-itemset Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk

11 Basic Concepts: k-itemsets and Their Supports Itemset: A set of one or more items k-itemset: X = {x 1,, x k } Ex. {Beer, Nuts, Diaper} is a 3-itemset (absolute) support (count) of X, sup{x}: Frequency or the number of occurrences of an itemset X Ex. sup{beer} = 3 Ex. sup{diaper} = 4 Ex. sup{beer, Diaper} = 3 Ex. sup{beer, Eggs} = 1 Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk

Basic Concepts: k-itemsets and Their Supports 12 Itemset: A set of one or more items k-itemset: X = {x 1,, x k } Ex. {Beer, Nuts, Diaper} is a 3-itemset (absolute) support (count) of X, sup{x}: Frequency or the number of occurrences of an itemset X Ex. sup{beer} = 3 Ex. sup{diaper} = 4 Ex. sup{beer, Diaper} = 3 Ex. sup{beer, Eggs} = 1 Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk (relative) support, s{x}: The fraction of transactions that contains X (i.e., the probability that a transaction contains X) Ex. s{beer} = 3/5 = 60% Ex. s{diaper} = 4/5 = 80% Ex. s{beer, Eggs} = 1/5 = 20%

13 Basic Concepts: Frequent Itemsets (Patterns) An itemset (or a pattern) X is frequent if the support of X is no less than a minsup threshold σ

14 Basic Concepts: Frequent Itemsets (Patterns) An itemset (or a pattern) X is frequent if the support of X is no less than a minsup threshold σ Let σ = 50% (σ: minsup threshold) For the given 5-transaction dataset All the frequent 1-itemsets: Beer: 3/5 (60%); Nuts: 3/5 (60%) Diaper: 4/5 (80%); Eggs: 3/5 (60%) All the frequent 2-itemsets: {Beer, Diaper}: 3/5 (60%) All the frequent 3-itemsets? None Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk

15 Basic Concepts: Frequent Itemsets (Patterns) An itemset (or a pattern) X is frequent if the support of X is no less than a minsup threshold σ Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper Let σ = 50% (σ: minsup threshold) For the given 5-transaction dataset All the frequent 1-itemsets: Beer: 3/5 (60%); Nuts: 3/5 (60%) Diaper: 4/5 (80%); Eggs: 3/5 (60%) All the frequent 2-itemsets: {Beer, Diaper}: 3/5 (60%) All the frequent 3-itemsets? None 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Why do these itemsets (shown on the left) form the complete set of frequent k-itemsets (patterns) for any k? Observation: We may need an efficient method to mine a complete set of frequent patterns

16 From Frequent Itemsets to Association Rules Comparing with itemsets, rules can be more telling Ex. Diaper Beer Buying diapers may likely lead to buying beers

17 From Frequent Itemsets to Association Rules Ex. Diaper Beer : Buying diapers may likely lead to buying beers How strong is this rule? (support, confidence) Measuring association rules: X Y (s, c) Both X and Y are itemsets

From Frequent Itemsets to Association Rules Ex. Diaper Beer: Buying diapers may likely lead to buying beers How strong is this rule? (support, confidence) Measuring association rules: X Y (s, c) Both X and Y are itemsets Support, s: The probability that a transaction contains X Y Ex. s{diaper, Beer} = 3/5 = 0.6 (i.e., 60%) Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk 18

19 From Frequent Itemsets to Association Rules Ex. Diaper Beer : Buying diapers may likely lead to buying beers How strong is this rule? (support, confidence) Measuring association rules: X Y (s, c) Both X and Y are itemsets Support, s: The probability that a transaction contains X Y Ex. s{diaper, Beer} = 3/5 = 0.6 (i.e., 60%) Confidence, c: The conditional probability that a transaction containing X also contains Y Calculation: c = sup(x Y) / sup(x) Ex. c = sup{diaper, Beer}/sup{Diaper} = ¾ = 0.75 Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Beer Containing beer Containing both {Beer} {Diaper} Diaper Containing diaper {Beer} {Diaper} = {Beer, Diaper}

20 Mining Frequent Itemsets and Association Rules Association rule mining Given two thresholds: minsup, minconf Find all of the rules, X Y (s, c) such that, s minsup and c minconf

21 Mining Frequent Itemsets and Association Rules Association rule mining Given two thresholds: minsup, minconf Find all of the rules, X Y (s, c) such that, s minsup and c minconf Let minsup = 50% Freq. 1-itemsets: Beer: 3, Nuts: 3, Diaper: 4, Eggs: 3 Freq. 2-itemsets: {Beer, Diaper}: 3 Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk

22 Mining Frequent Itemsets and Association Rules Association rule mining Given two thresholds: minsup, minconf Find all of the rules, X Y (s, c) such that, s minsup and c minconf Let minsup = 50% Freq. 1-itemsets: Beer: 3, Nuts: 3, Diaper: 4, Eggs: 3 Freq. 2-itemsets: {Beer, Diaper}: 3 Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Let minconf = 50% Beer Diaper (60%, 100%) Diaper Beer (60%, 75%)

Mining Frequent Itemsets and Association Rules Association rule mining Given two thresholds: minsup, minconf Find all of the rules, X Y (s, c) such that, s minsup and c minconf Let minsup = 50% Freq. 1-itemsets: Beer: 3, Nuts: 3, Diaper: 4, Eggs: 3 Freq. 2-itemsets: {Beer, Diaper}: 3 Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk 23 Let minconf = 50% Beer Diaper (60%, 100%) Diaper Beer (60%, 75%) (Q: Are these all rules?)

24 Mining Frequent Itemsets and Association Rules Association rule mining Given two thresholds: minsup, minconf Find all of the rules, X Y (s, c) such that, s minsup and c minconf Let minsup = 50% Freq. 1-itemsets: Beer: 3, Nuts: 3, Diaper: 4, Eggs: 3 Freq. 2-itemsets: {Beer, Diaper}: 3 Let minconf = 50% Beer Diaper (60%, 100%) Diaper Beer (60%, 75%) Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Observations: Mining association rules and mining frequent patterns are very close problems Scalable methods are needed for mining large datasets

25 Challenge: There Are Too Many Frequent Patterns! A long pattern contains a combinatorial number of sub-patterns How many frequent itemsets does the following TDB 1 contain? TDB 1: T 1 : {a 1,, a 50 }; T 2 : {a 1,, a 100 } Assuming (absolute) minsup = 1 Let s have a try 1-itemsets: {a 1 }: 2, {a 2 }: 2,, {a 50 }: 2, {a 51 }: 1,, {a 100 }: 1, 2-itemsets: {a 1, a 2 }: 2,, {a 1, a 50 }: 2, {a 1, a 51 }: 1,, {a 99, a 100 }: 1,,,, 99-itemsets: {a 1, a 2,, a 99 }: 1,, {a 2, a 3,, a 100 }: 1 100-itemset: {a 1, a 2,, a 100 }: 1

26 Challenge: There Are Too Many Frequent Patterns! A long pattern contains a combinatorial number of sub-patterns How many frequent itemsets does the following TDB 1 contain? TDB 1: T 1 : {a 1,, a 50 }; T 2 : {a 1,, a 100 } Assuming (absolute) minsup = 1 Let s have a try 1-itemsets: {a 1 }: 2, {a 2 }: 2,, {a 50 }: 2, {a 51 }: 1,, {a 100 }: 1, 2-itemsets: {a 1, a 2 }: 2,, {a 1, a 50 }: 2, {a 1, a 51 }: 1,, {a 99, a 100 }: 1,,,, 99-itemsets: {a 1, a 2,, a 99 }: 1,, {a 2, a 3,, a 100 }: 1 100-itemset: {a 1, a 2,, a 100 }: 1 The total number of frequent itemsets: A too huge set for any one to compute or store!

27 Expressing Patterns in Compressed Form: Closed Patterns How to handle such a challenge? Solution 1: Closed patterns: A pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y כ X, with the same support as X

28 Expressing Patterns in Compressed Form: Closed Patterns How to handle such a challenge? Solution 1: Closed patterns: A pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y כ X, with the same support as X Let Transaction DB TDB 1 : T 1 : {a 1,, a 50 }; T 2 : {a 1,, a 100 } Suppose minsup = 1. How many closed patterns does TDB 1 contain? Two: P 1 : {a 1,, a 50 }: 2 ; P 2 : {a 1,, a 100 }: 1

29 Expressing Patterns in Compressed Form: Closed Patterns How to handle such a challenge? Solution 1: Closed patterns: A pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y כ X, with the same support as X Let Transaction DB TDB 1 : T 1 : {a 1,, a 50 }; T 2 : {a 1,, a 100 } Suppose minsup = 1. How many closed patterns does TDB 1 contain? Two: P 1 : {a 1,, a 50 }: 2 ; P 2 : {a 1,, a 100 }: 1 Closed pattern is a lossless compression of frequent patterns Reduces the # of patterns but does not lose the support information! You will still be able to say: {a 2,, a 40 }: 2, {a 5, a 51 }: 1

30 Expressing Patterns in Compressed Form: Max-Patterns Solution 2: Max-patterns: A pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X

31 Expressing Patterns in Compressed Form: Max-Patterns Solution 2: Max-patterns: A pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X Difference from close-patterns? Do not care the real support of the sub-patterns of a max-pattern Let Transaction DB TDB 1 : T 1 : {a 1,, a 50 }; T 2 : {a 1,, a 100 } Suppose minsup = 1. How many max-patterns does TDB 1 contain? One: P: {a 1,, a 100 }: 1

Expressing Patterns in Compressed Form: Max-Patterns Solution 2: Max-patterns: A pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X Difference from close-patterns? 32 Do not care the real support of the sub-patterns of a max-pattern Let Transaction DB TDB 1 : T 1 : {a 1,, a 50 }; T 2 : {a 1,, a 100 } Suppose minsup = 1. How many max-patterns does TDB 1 contain? One: P: {a 1,, a 100 }: 1 Max-pattern is a lossy compression! We only know {a 1,, a 40 } is frequent But we do not know the real support of {a 1,, a 40 },, any more! Thus in many applications, close-patterns are more desirable than max-patterns

33 Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Basic Concepts Efficient Pattern Mining Methods The Apriori Algorithm Application in Classification Pattern Evaluation Summary

34 Efficient Pattern Mining Methods The Downward Closure Property of Frequent Patterns The Apriori Algorithm Extensions or Improvements of Apriori Mining Frequent Patterns by Exploring Vertical Data Format FPGrowth: A Frequent Pattern-Growth Approach Mining Closed Patterns

The Downward Closure Property of Frequent Patterns Observation: From TDB 1: T 1 : {a 1,, a 50 }; T 2 : {a 1,, a 100 } We get a frequent itemset: {a 1,, a 50 } Also, its subsets are all frequent: {a 1 }, {a 2 },, {a 50 }, {a 1, a 2 },, {a 1,, a 49 }, There must be some hidden relationships among frequent patterns! 35

The Downward Closure Property of Frequent Patterns Observation: From TDB 1: T 1 : {a 1,, a 50 }; T 2 : {a 1,, a 100 } We get a frequent itemset: {a 1,, a 50 } Also, its subsets are all frequent: {a 1 }, {a 2 },, {a 50 }, {a 1, a 2 },, {a 1,, a 49 }, There must be some hidden relationships among frequent patterns! The downward closure (also called Apriori ) property of frequent patterns If {beer, diaper, nuts} is frequent, so is {beer, diaper} Every transaction containing {beer, diaper, nuts} also contains {beer, diaper} Apriori: Any subset of a frequent itemset must be frequent A sharp knife for pruning! 36

The Downward Closure Property of Frequent Patterns 37 Observation: From TDB 1: T 1 : {a 1,, a 50 }; T 2 : {a 1,, a 100 } We get a frequent itemset: {a 1,, a 50 } Also, its subsets are all frequent: {a 1 }, {a 2 },, {a 50 }, {a 1, a 2 },, {a 1,, a 49 }, There must be some hidden relationships among frequent patterns! The downward closure (also called Apriori ) property of frequent patterns If {beer, diaper, nuts} is frequent, so is {beer, diaper} Every transaction containing {beer, diaper, nuts} also contains {beer, diaper} Apriori: Any subset of a frequent itemset must be frequent Efficient mining methodology A sharp knife for pruning! If any subset of an itemset S is infrequent, then there is no chance for S to be frequent why do we even have to consider S!?

Apriori Pruning and Scalable Mining Methods Apriori pruning principle: If there is any itemset which is infrequent, its superset should not even be generated! (Agrawal & Srikant @VLDB 94, Mannila, et al. @ KDD 94) Scalable mining Methods: Three major approaches Level-wise, join-based approach: Apriori (Agrawal & Srikant@VLDB 94) Vertical data format approach: Eclat (Zaki, Parthasarathy, Ogihara, Li @KDD 97) Frequent pattern projection and growth: FPgrowth (Han, Pei, Yin @SIGMOD 00) 38

39 Apriori: A Candidate Generation & Test Approach Outline of Apriori (level-wise, candidate generation and test) Initially, scan DB once to get frequent 1-itemset Repeat Generate length-(k+1) candidate itemsets from length-k frequent itemsets Test the candidates against DB to find frequent (k+1)-itemsets Set k := k +1 Until no frequent or candidate set can be generated Return all the frequent itemsets derived

40 The Apriori Algorithm (Pseudo-Code) C k : Candidate itemset of size k F k : Frequent itemset of size k K := 1; F k := {frequent items}; // frequent 1-itemset While (F k!= ) do { // when F k is non-empty C k+1 := candidates generated from F k ; // candidate generation Derive F k+1 by counting candidates in C k+1 with respect to TDB at minsup; k := k + 1 } return k F k // return F k generated at each level

41 The Apriori Algorithm An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E minsup = 2 1 st scan Itemset sup {A} 2 C 1 {B} 3 F 1 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset C 2 C 2 F 2 Itemset sup {A, B} 1 2 nd scan {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 sup {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C Itemset 3 3 rd scan F 3 {B, C, E} Itemset sup {B, C, E} 2

42 Apriori: Implementation Tricks How to generate candidates? Step 1: self-joining F k Step 2: pruning

43 Apriori: Implementation Tricks How to generate candidates? Step 1: self-joining F k Step 2: pruning Example of candidate-generation F 3 = {abc, abd, acd, ace, bcd} Self-joining: F 3 *F 3 abcd from abc and abd acde from acd and ace self-join self-join abc abd acd ace bcd abcd acde

44 Apriori: Implementation Tricks How to generate candidates? Step 1: self-joining F k Step 2: pruning Example of candidate-generation F 3 = {abc, abd, acd, ace, bcd} Self-joining: F 3 *F 3 abcd from abc and abd acde from acd and ace Pruning: self-join self-join abc abd acd ace bcd abcd acde pruned acde is removed because ade is not in F 3 C 4 = {abcd}

Candidate Generation: An SQL Implementation Suppose the items in F k-1 are listed in an order self-join self-join abc abd acd ace bcd Step 1: self-joining F k-1 insert into C k abcd acde 45 select p.item 1, p.item 2,, p.item k-1, q.item k-1 from F k-1 as p, F k-1 as q where p.item 1 = q.item 1,, p.item k-2 = q.item k-2, p.item k-1 < q.item k-1 Step 2: pruning for all itemsets c in C k do for all (k-1)-subsets s of c do if (s is not in F k-1 ) then delete c from C k pruned

Apriori Adv/Disadv Advantages: Uses large itemset property Easily parallelized Easy to implement Disadvantages: Assumes transaction database is memory resident Requires up to m database scans 46

47 Classification based on Association Rules (CBA) Why? Can effectively uncover the correlation structure in data AR are typically quite scalable in practice Rules are often very intuitive Hence classifier built on intuitive rules is easier to interpret When to use? On large dynamic datasets where class labels are available and the correlation structure is unknown. Multi-class categorization problems E.g. Web/Text Categorization, Network Intrusion Detection

48 Example: Text categorization Input <feature vector> <class label(s)> <feature vector> = w1,,wn <class label(s)> = c1,,cm Run AR with minsup and minconf Prune rules of form w1 w2, [w1,c2] c3 etc. Keep only rules satisfying the constraing W C (LHS only composed of w1, wn and RHS only composed of c1, cm)

49 CBA: Text Categorization (cont.) Order remaining rules By confidence 100% R1: W1 C1 (support 40%) R2: W4 C2 (support 60%) 95% R3: W3 C2 (support 30%) R4: W5 C4 (support 70%) And within each confidence level by support Ordering R2, R1, R4, R3

50 CBA: Text Categorization (cont.) Take training data and evaluate the predictive ability of each rule, prune away rules that are subsumed by superior rules T1: W1 W5 C1,C4 T2: W2 W4 C2 T3: W3 W4 C2 T4: W5 W8 C4 T5: W9 C2 Note: only subset of transactions in training data Rule R3 would be pruned in this example if it is always subsumed by Rule R2 For remaining transactions pick most dominant class as default T5 is not covered, so C2 is picked in this example

Formal Concepts of Model Given two rules r i and r j, define: r i r j if The confidence of r i is greater than that of r j, or Their confidences are the same, but the support of r i is greater than that of r j, or Both the confidences and supports are the same, but r i is generated earlier than r j. Our classifier model is of the following format: <r 1, r 2,, r n, default_class>, where r i R, r a r b if b>a 51 Other models possible Sort by length of antecedent

52 Using the CBA model to classify For a new transaction W1, W3, W5 Pick the k-most confident rules that apply (using the precedence ordering established in the baseline model) The resulting classes are the predictions for this transaction If k = 1 you would pick C1 If k = 2 you would pick C1, C2 (multi-class) Similarly if W9, W10 you would pick C2 (default) Accuracy measurements as before (Classification Error)

53 CBA: Procedural Steps Preprocessing, Training and Testing data split Compute AR on Training data Keep only rules of form X C C is class label itemset and X is feature itemset Order AR According to confidence According to support (at each confidence level) Prune away rules that lack sufficient predictive ability on Training data (starting top-down) Rule subsumption For data that is not predictable pick most dominant class as default class Test on testing data and report accuracy

54 Backup Slides

55 Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Basic Concepts Efficient Pattern Mining Methods Pattern Evaluation Summary

Summary 56 Basic Concepts What Is Pattern Discovery? Why Is It Important? Basic Concepts: Frequent Patterns and Association Rules Compressed Representation: Closed Patterns and Max-Patterns Efficient Pattern Mining Methods The Downward Closure Property of Frequent Patterns The Apriori Algorithm Extensions or Improvements of Apriori Mining Frequent Patterns by Exploring Vertical Data Format FPGrowth: A Frequent Pattern-Growth Approach Mining Closed Patterns Pattern Evaluation Interestingness Measures in Pattern Mining Interestingness Measures: Lift and χ 2 Null-Invariant Measures Comparison of Interestingness Measures

57 Recommended Readings (Basic Concepts) R. Agrawal, T. Imielinski, and A. Swami, Mining association rules between sets of items in large databases, in Proc. of SIGMOD'93 R. J. Bayardo, Efficiently mining long patterns from databases, in Proc. of SIGMOD'98 N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering frequent closed itemsets for association rules, in Proc. of ICDT'99 J. Han, H. Cheng, D. Xin, and X. Yan, Frequent Pattern Mining: Current Status and Future Directions, Data Mining and Knowledge Discovery, 15(1): 55-86, 2007

Recommended Readings (Efficient Pattern Mining Methods) 58 R. Agrawal and R. Srikant, Fast algorithms for mining association rules, VLDB'94 A. Savasere, E. Omiecinski, and S. Navathe, An efficient algorithm for mining association rules in large databases, VLDB'95 J. S. Park, M. S. Chen, and P. S. Yu, An effective hash-based algorithm for mining association rules, SIGMOD'95 S. Sarawagi, S. Thomas, and R. Agrawal, Integrating association rule mining with relational database systems: Alternatives and implications, SIGMOD'98 M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, Parallel algorithm for discovery of association rules, Data Mining and Knowledge Discovery, 1997 J. Han, J. Pei, and Y. Yin, Mining frequent patterns without candidate generation, SIGMOD 00 M. J. Zaki and Hsiao, CHARM: An Efficient Algorithm for Closed Itemset Mining, SDM'02 J. Wang, J. Han, and J. Pei, CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets, KDD'03 C. C. Aggarwal, M.A., Bhuiyan, M. A. Hasan, Frequent Pattern Mining Algorithms: A Survey, in Aggarwal and Han (eds.): Frequent Pattern Mining, Springer, 2014

59 Recommended Readings (Pattern Evaluation) C. C. Aggarwal and P. S. Yu. A New Framework for Itemset Generation. PODS 98 S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97 M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94 E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE 03 P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02 T. Wu, Y. Chen and J. Han, Re-Examination of Interestingness Measures in Pattern Mining: A Unified Framework, Data Mining and Knowledge Discovery, 21(3):371-397, 2010