Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Similar documents
Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

FP-growth and PrefixSpan

CSE 5243 INTRO. TO DATA MINING

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Association Rules. Fundamentals

CSE 5243 INTRO. TO DATA MINING

Chapters 6 & 7, Frequent Pattern Mining

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

CS 484 Data Mining. Association Rule Mining 2

Unit II Association Rules

Data Mining Concepts & Techniques

CS 412 Intro. to Data Mining

COMP 5331: Knowledge Discovery and Data Mining

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Lecture Notes for Chapter 6. Introduction to Data Mining

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

732A61/TDDD41 Data Mining - Clustering and Association Analysis

CS 584 Data Mining. Association Rule Mining 2

DATA MINING - 1DL360

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Association Rule Mining on Web

Data Mining and Analysis: Fundamental Concepts and Algorithms

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

DATA MINING - 1DL360

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

CS5112: Algorithms and Data Structures for Applications

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Knowledge Discovery and Data Mining I

Machine Learning: Pattern Mining

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Association Analysis Part 2. FP Growth (Pei et al 2000)

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Frequent Itemset Mining

Association Analysis. Part 1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

sor exam What Is Association Rule Mining?

Associa'on Rule Mining

Introduction to Data Mining

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

Handling a Concept Hierarchy

Association Analysis 1

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

COMP 5331: Knowledge Discovery and Data Mining

CPT+: A Compact Model for Accurate Sequence Prediction

Association)Rule Mining. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Sequential Pattern Mining

Descrip<ve data analysis. Example. E.g. set of items. Example. Example. Data Mining MTAT (6EAP)

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Frequent Itemset Mining

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example

Data Mining Concepts & Techniques

Approximate counting: count-min data structure. Problem definition

Temporal Data Mining

1 Frequent Pattern Mining

Frequent Itemset Mining

Association Rules. Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION. Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION

Un nouvel algorithme de génération des itemsets fermés fréquents

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Density-Based Clustering

Frequent Itemset Mining

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #11: Frequent Itemsets

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Mining Strong Positive and Negative Sequential Patterns

Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F

15 Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

OPPA European Social Fund Prague & EU: We invest in your future.

FREQUENT PATTERN SPACE MAINTENANCE: THEORIES & ALGORITHMS

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s

CS738 Class Notes. Steve Revilak

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Discovering Non-Redundant Association Rules using MinMax Approximation Rules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

DATA MINING - 1DL105, 1DL111

Pattern Space Maintenance for Data Updates. and Interactive Mining

Mining Positive and Negative Fuzzy Association Rules

On Minimal Infrequent Itemset Mining

Association Analysis. Part 2

Unsupervised Learning. k-means Algorithm

On Condensed Representations of Constrained Frequent Patterns

Maintaining Frequent Itemsets over High-Speed Data Streams

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Data Warehousing & Data Mining

Theory of Dependence Values

Detecting Functional Dependencies Felix Naumann

Data Warehousing & Data Mining

Transcription:

Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1

Outline n Association Rule Mining Basic Concepts n Association Rule Mining Algorithms: Single-dimensional Boolean associations Multi-level associations Multi-dimensional associations n Association vs. Correlation n Adding constraints n Applications/extensions of frequent pattern mining n Summary 3 Motivation Finding regularities/patterns in data n Market basket analysis: What products were often purchased together? What are the subsequent purchases after buying a PC? n Bioinformatic: What kinds of DNA are sensitive to this new drug? n Web mining: Which web documents are similar to a given one? 4 2

Association Rule Mining Finds interesting relationships among data items n Associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Form: -A à -B [%support, %confidence] E.g., Bread Milk [support = 2%, confidence = 70%] 2% of all transaction databases, bread and milk are bought together 70% of customers who bought bread also bought milk n Rules are interesting (or strong) if they satisfy A min support threshold A minimum confidence threshold 5 Importance n Foundation for many essential data mining tasks Association, correlation, causality Sequential patterns, temporal association, partial periodicity, spatial and multimedia association Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression) n Broad applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis Web log (click stream) analysis, DNA sequence analysis, etc. 6 3

Basic Concepts n I = set of all possible items n A transaction, T is set of items. T I n D = data set of all transactions n Association Rule: A B [%support, %confidence], where A I, B I and A B = Support(A B) prob. that a transaction contains both A and B = P(A B) =? A B / D Confidence(A B) prob. that a trans. having A also contains B = P(B A) =? A B / A = Support(A B) / P(A) Support-count(A&B) / support-count(a) 7 Example Customer buys beer Trans-id Customer buys both Items bought 100 A, B, C 200 A, C 300 A, D 400 B, E, F Customer buys diaper n I = {A, B, C, D, E, F} n Min-support = 50% n Min-confidence = 50% n A C [50%, 66.7%] Support(A C) = 2/4 Confidence(A C) = 2/3 n C A [50%, 100%] Support(C A) = 2/4 Confidence(C A) = 2/2 n What about A B? {A,C} B, etc.? 8 4

Types of association rules n Single vs. multi-dimensional rules Buys(X, {bread,egg}) Buys(X, milk) Buys(X, beer) Age(X, over-20) n Boolean vs. quantitative rules Buys-bread-and-egg Buys-milk Age(X, 1..15) Income(X, 0..2K) n Single vs. multi-level rules Buys(X, bread) Buys(X, milk) Buys(X, bread) Buys(X, diary-product) (given milk is a kind of diary-product) {Bread, Egg} {milk} 9 Outline n Association Rule Mining Basic Concepts n Association Rule Mining Algorithms: Single-dimensional Boolean associations Multi-level associations Multi-dimensional associations n Association vs. Correlation n Adding constraints n Applications/extensions of frequent pattern mining n Summary 10 5

Mining single dimensional Boolean rules n Input: A database of transactions Each transaction is a list of items n Output: Rules that associate the presence of one itemset with the other (i.e., LHS & RHS of rule) No restrictions on number of items in both sets 11 Example Customer buys beer Trans-id Customer buys both Items bought 100 A, B, C 200 A, C 300 A, D 400 B, E, F Customer buys diaper n I = {A, B, C, D, E, F} n Min-support = 50% n Min-confidence = 50% n A C [50%, 66.7%] Support(A C) = 2/4 Confidence(A C) = 2/3 n C A [50%, 100%] Support(C A) = 2/4 Confidence(C A) = 2/2 n What about A B? {A,C} B, etc.? 12 6

Frequent pattern Frequent pattern n pattern (itemset, sequence, etc.) that occurs frequently in a database [AIS93] n I.e., those with % of occurrences > a minimum support Transaction-id Items bought 100 A, B, C 200 A, C 300 A, D 400 B, E, F Min. support 50% Frequent pattern Sup Support {A} 3 75% {B} 2 50% {C} 2 50% {A, C} 2 50% Note: Omit {A, B}, {B, C}, {B, E} etc. 13 What we really need for mining n Goal: Rules with high support/confidence n How to compute? Support: Find sets of items that occur frequently Confidence: Find frequency of subsets of supported itemsets n If we have all frequently occurring sets of items (frequent itemsets), we can compute support and confidence! How can we compute frequent itemsets efficiently? 14 7

The Apriori Algorithm n Approach level-wise generate and test n Basic steps: Iterate with increasing size of itemsets n Generate candidates possible itemsets n Test for large candidates itemsets that satisfy min support, i.e., frequent itemsets Use the frequent itemsets to generate association rules 15 Apriori pruning principle n Any subset of a frequent itemset must be frequent n A super set of an infrequent itemset is infrequent Why? n E.g., # {A and B and C} transactions # {A and B} transactions {A, B, C} is frequent à {A, B} must be frequent {A, B} is infrequent à {A, B, C} is infrequent 16 8

The Apriori Algorithm (Cont) n Method: generate length (k+1) candidate itemsets, C k+1 from length k frequent itemsets, L k, and test candidates to obtain L k+1 n Assume itemsets are in lexicographical order n Performance studies show its efficiency and scalability [Agrawal & Srikant 1994, Mannila, et al. 1994] 17 The Apriori Algorithm An Example (Min Sup 50%) Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1 st scan C 1 L 1 CFrequency 2 sup 50%, Confidence C 2 100%: L {A, B} 1 2 2A nd {A, B} sup scan à C {A, C} 2 {A, C} {A, C} 2 {A, E} 1 B à E {B, C} 2 {A, E} {B, C} 2 {B, E} 3 BC à E {B, C} {B, E} 3 {C, E} 2 CE à B {B, E} {C, E} 2 {C, E} BE à C C 3 rd scan L sup 3 3 C 4 is empty {B, C, E} sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 {B, C, E} 2 sup {A} 2 {B} 3 {C} 3 {E} 3 Note: {A, B, C}, {A, B, E}, {A, C, E} are not in C 3 18 9

Generating rules n For each frequent itemset l For each non-empty proper subset S of l Output rule: S (l- S) if its confidence > threshold E.g., consider L = {B, BCE} (write AB for {A, B}) Assume min thresholds are satisfied, from a frequent item set BCE, output B CE C BE E BC BC E BE C CE B No rule generated from frequent item set B since B has no non-empty proper subset. 19 The Apriori Algorithm An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1 st scan C 1 L 1 CFrequency 2 sup 50%, Confidence C 2 100%: L {A, B} 1 2 2A nd {A, B} sup scan à C {A, C} 2 {A, C} {A, C} 2 {A, E} 1 B à E {B, C} 2 {A, E} {B, C} 2 {B, E} 3 BC à E {B, C} {B, E} 3 {C, E} 2 CE à B {B, E} {C, E} 2 {C, E} BE à C C 3 rd scan L sup 3 3 C 4 is empty {B, C, E} sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Note: {A, B, C}, {A, B, E}, {A, C, E} are not in C3 {B, C, E} 2 sup {A} 2 {B} 3 {C} 3 {E} 3 20 10

The Apriori Algorithm An Example Database TDB Tid L 2 Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Min-Support 50% Min-Confidence 100% A C [2/4%, 2/2%] C A [2/4%, 2/3%] B C [2/4%, 2/3%] C B [2/4%, 2/3%] B E [3/4%, 3/3%] E B [3/4%, 3/3%] C E [2/4%, 2/3%] E C [2/4%, 2/3%] L 1 sup {A} 2 {B} 3 {C} 3 {E} 3 B CE [2/4%, 2/3%] C BE [2/4%, 2/3%] E BC [2/4%, 2/3%] BC E [2/4%, 2/2%] CE B [2/4%, 2/2%] BE C [2/4%, 2/3%] L 3 sup {B, C, E} 2 21 The Apriori Algorithm An Example Database TDB Tid L 2 Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 L 1 sup {A} 2 {B} 3 {C} 3 {E} 3 Min-Support 50%, Min-Confidence 100%: A à C [75%, 100%] B à E [75%, 100%] E à B [75%, 100%] BC à E [50%, 100%] CE à B [50%, 100%] L 3 sup {B, C, E} 2 22 11

The Apriori Algorithm n Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!= ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return k L k ; What is the maximum number of scans of the DB? 23 Important details n How to generate candidates? n How to count supports of candidates? 24 12

Candidate generation To obtain C k+1 from L k n Self joining L k Example: L 3 ={ABC, ABD, ACD, ACE, BCD} Self-joining: L 3 L 3 ABC and ABD à ABCD ACD and ACE à ACDE No more, thus potential C 4 = {ABCD, ACDE} n Pruning Example: remove ACDE since CDE is not in L 3 C 4 = {ABCD} 25 Candidate generation (cont) n Assume the items in L k are listed in an order n Step 1: self-joining L k (requires sharing first k-1 items) insert into C k+1 select p.item 1, p.item 2,, p.item k, q.item 1, q.item 2,, q.item k from L k p, L k q where p.item 1 =q.item 1,, p.item k-1 =q.item k-1, p.item k < q.item k n Step 2: pruning itemsets c in C k+1 do k-subsets s of c do if (s is not in L k ) then delete c from C k+1 26 13

The Apriori Algorithm An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1 st scan C 1 L 1 CFrequency 2 sup 50%, Confidence C 2 100%: L {A, B} 1 2 2A nd {A, B} sup scan à C {A, C} 2 {A, C} {A, C} 2 {A, E} 1 B à E {B, C} 2 {A, E} {B, C} 2 {B, E} 3 BC à E {B, C} {B, E} 3 {C, E} 2 CE à B {B, E} {C, E} 2 {C, E} BE à C C 3 rd scan 3 L sup 3 C 4 is empty {B, C, E} sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 {B, C, E} 2 sup {A} 2 {B} 3 {C} 3 {E} 3 Note: {A, B, C}, {A, B, E}, {A, C, E} are not in C 3 27 The Apriori Algorithm An Example No pruning in any step Self joining L 1 requires sharing of first 0 item appended with two elements, each from the joined element à L 1 x L 1 gives C 2 C 2 L sharing of first 1 item 2 sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 C 3 L 3 {B, C, E} Self joining L 2 requires L 1 - Only middle two sets gives C 3 Self joining L 3 requires sharing of first 2 items - None is possible since L 3 has one entry sup {B, C, E} 2 sup {A} 2 {B} 3 {C} 3 {E} 3 {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C 4 is empty Note: {A, B, C}, {A, B, E}, {A, C, E} are not in C 3 28 14

Counting Supports of Candidates n Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates n Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: provides hashing addresses n Example: given a hash tree representing itemsets found so far and a transaction, find all the candidates contained in the transaction (and increment their counts) 29 Example Subset function 3,6,9 1,4,7 2,5,8 Leaves are itemsets of size three found so far Transaction: 1 2 3 5 6 Goal is to scan thru the transaction and update support count of the itemsets 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 2 3 4 5 6 7 1 3 6 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 30 15

Example Subset function 3,6,9 1,4,7 2,5,8 Leaves are itemsets found so far Transaction: 1 2 3 5 6 1 + 2 3 5 6 1 2 + 3 5 6 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 1 2 3 5 6 2 3 4 5 6 7 1 3 6 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 1 2 3 5 6 3 6 7 3 6 8 31 Example Subset function 3,6,9 1,4,7 2,5,8 1 + 2 3 5 6 1 3 + 5 6 1 2 + 3 5 6 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 1 2 3 5 6 Leaves are itemsets found so far Transaction: 1 2 3 5 6 2 3 4 5 6 7 1 3 6 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 1 2 3 5 6 1 2 3 + 5 6 3 6 7 3 6 8 32 16

Apriori in SQL n S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD 98 n Hard to get good performance out of pure (SQL-92) based approaches alone n Make use of object-relational extensions Get orders of magnitude improvement 33 Efficiency issues n Challenges: Multiple scans of large transaction database Huge number of candidates Tedious workload of support counting for candidates 34 17

Improving Efficiency of Apriori General ideas: n Reduce number of candidates n Reduce number of database scans/transactions n Facilitate support counting of candidates 35 Reduce Candidates n A hash-based technique to reduce C k n Idea: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent n J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD 95 n Example: k = 2 Items A, B, C, D, E are ordered as 1, 2, 3, 4, 5, respectively For 2-item set {X,Y}, define a hash function, h(x, Y) = o(x) * 10 + o(y) mod 7, where o(x) = order of X 36 18

An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E C 2 {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} During the 1 st scan of TDB to determine L 1 from C 1 - Create a hash table - omit 2-itemset in a bucket whose count < threshold 2 Items A, B, C, D, E are ordered as 1, 2, 3, 4, 5 h(x, Y) = o(x) * 10 + o(y) mod 7 h(a,c) = 13 mod 7 = 6 h(b,c) = 2 h(a,b) = 5 h(b,c) = 2 h(a,d) = 14 mod 7 = 0 h(b,e) = 4 h(a,c) = 6 h(b,e) = 4 h(c,d) = 34 mod 7 = 6 h(c,e) = 0 h(a,e) = 1 h(c,e) = 0 h(b,e) = 4 Bucket 0 1 2 3 4 5 6 Content AD AC, CD CE BC BE CE AE BC BE AB AC BE Count 3 1 2 0 3 1 3 37 The Apriori Algorithm An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1 st scan C 1 L 1 CFrequency 2 sup 50%, Confidence C 2 100%: L {A, B} 1 2 2A nd {A, B} sup scan à C {A, C} 2 {A, C} {A, C} 2 {A, E} 1 B à E {B, C} 2 {A, E} {B, C} 2 {B, E} 3 BC à E {B, C} {B, E} 3 {C, E} 2 CE à B {B, E} {C, E} 2 {C, E} BE à C C 3 rd scan L sup 3 3 C 4 is empty {B, C, E} sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Note: {A, B, C}, {A, B, E}, {A, C, E} are not in C3 {B, C, E} 2 sup {A} 2 {B} 3 {C} 3 {E} 3 38 19

Exploring Vertical Data Format n Use tid-list, the list of transaction-ids containing an itemset n Compression of tid-lists A: t1, t2, t3, sup(a)=3 B: t2, t3, t4, sup(b)=3 AB: t2, t3, sup(ab)=2 n Major operation: intersection of tid-lists n M. Zaki et al. New algorithms for fast discovery of association rules. In KDD 97 n P. Shenoy et al. Turbo-charging vertical mining of large databases. In SIGMOD 00 39 Improving Efficiency of Apriori General ideas: n Reduce number of candidates n Reduce number of database scans/transactions n Facilitate support counting of candidates 40 20

DIC: Reduce number of scans ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D {} lattice n n n Apriori Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins Dynamic - The counting and adding of new candidates at any start point Transactions 1-itemsets 2-itemsets S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD 97 DIC 1-itemsets 2-items 3-items 41 Partition: Scan Database Only Twice n A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB 95 n Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns 42 21

Sampling for Frequent Patterns n H. Toivonen. Sampling large databases for association rules. In VLDB 96 n Select a sample of original database, mine frequent patterns within sample using Apriori n Scan database once to verify frequent itemsets found in sample n Scan database again to find missed frequent patterns 43 Improving Efficiency of Apriori General ideas: n Reduce number of candidates n Reduce number of database scans/transactions n Facilitate support counting of candidates 44 22

Bottleneck of Apriori n Sources of computational cost Multiple database scans Long patterns à many scans and à huge number of candidates To find frequent itemset i 1 i 2 i 100 n # of scans: 100 n # of Candidates ~ 2 100-1 = 1.27*10 30! n Bottleneck: candidate-generation-and-test n Can we avoid candidate generation? 45 Frequent Pattern Growth (FP-Growth) n Method of mining frequent itemsets without candidate generation n Ideas: Use compact data structure: FP-tree to represent frequent itemsets from transactions Grow long patterns from short ones using local frequent items 46 23

Construct FP-tree TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w} 400 {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n} 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, F-list 3. Scan DB again, construct FP-tree (ordered) frequent items {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, b, p} {f, c, a, m, p} Item count f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 F-list=f-c-a-b-m-p c:1 23 f:1 23 4 a:1 23 m:1 2 p:1 2 {} b:1 b:1 m:1 c:1 b:1 p:1 47 Construct FP-tree TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w} 400 {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n} 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, F-list 3. Scan DB again, construct FP-tree (ordered) frequent items {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, b, p} {f, c, a, m, p} Item count f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 F-list=f-c-a-b-m-p c:3 m:2 p:2 f:4 a:3 {} b:1 b:1 m:1 c:1 b:1 p:1 48 24

Finding conditional pattern bases n n For each frequent item X, starting at X s header table, follow the link to traverse the FP-tree Accumulate all of transformed prefix paths of item X to {} form X s conditional pattern base (with count of X) f:4 item cond. pattern base Item count c f:3 f 4 c:3 b:1 c 4 a fc:3 a 3 a:3 b fca:1, f:1, c:1 b 3 m 3 m fca:2, fcab:1 p 3 m:2 b:1 p fcam:2, cb:1 min_support = 3 p:2 m:1 c:1 b:1 p:1 49 Finding conditional FP-tree n For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base Item Cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Min. support = 3 m-cond FP-tree { } f:2 3 c:2 3 a:2 3 b:1 m-frequent patterns fm:3, cm:3, am:3, fcm:3, fam:3, cam:3 fcam:3 50 25

Finding conditional FP-tree n For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base Item Cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 c-cond FP-tree { } f:3 a-cond FP-tree { } f:3 c:3 c-frequent patterns fc:3 a-frequent patterns fa:3, ca:3 fca:3 Min. support = 3 51 Finding conditional FP-tree n For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base Item Cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 b-cond FP-tree { } f:1 2 c:1 c:1 a:1 b-frequent patterns none Min. support = 3 52 26

Finding conditional FP-tree n For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base Item Cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Min. support = 3 p-cond FP-tree { } f:2 c:2 a:2 m:2 c:1 b:1 p-frequent patterns cp:3 53 Frequent patterns from FP-tree TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w} 400 {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n} Item Cond. pattern base (ordered) frequent items {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, b, p} {f, c, a, m, p} Min. support = 3 Frequent patterns c f:3 fc:3 a fc:3 fa:3, ca:3 fca:3 b fca:1, f:1, c:1 none m fca:2, fcab:1 fm:3, cm:3, am:3, p fcam:2, cb:1 cp:3 fcm:3, fam:3, cam:3 fcam:3 Do these give same results as Apriori? 54 27

Mining FP-trees n Idea: Frequent pattern growth Recursively grow frequent pattern by pattern and database partition n Method ( sketched, see the FP-growth algorithm in the text) For each frequent pattern, construct its conditional pattern-base, and then its conditional FP-tree Repeat the above on each newly created conditional FP-tree single path will generate all the combinations of its sub-paths, each of which is a frequent pattern Until the resulting FP-tree is empty, or it contains only a single path 55 Recursion on conditional FP-trees {} {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of am : (fc:3) f:3 c:3 am-conditional FP-tree {} Cond. pattern base of cm : (f:3) f:3 cm-conditional FP-tree {} Cond. pattern base of cam : (f:3) f:3 cam-conditional FP-tree 56 28

Building Block n n Suppose a (conditional) FP-tree T has a shared single prefix-path Mining can be decomposed into two parts {} Reduction of the single prefix path into one node (e.g., r 1 ) Concatenation of the mining results of the two parts a 1 :n 1 a 2 :n 2 a 3 :n 3 {} r 1 b 1 :m 1 C 1 :k 1 r 1 = a 1 :n 1 a 2 :n 2. b 1 :m 1 C 1 :k 1 a 3 :n 3 C 2 :k 2 C 3 :k 3 C 2 :k 2 C 3 :k 3 57 Benefits of the FP-tree Structure n Complete Retain complete information for frequent patterns Never break a long pattern of any transaction n Compact Non-redundant n Frequent patterns can be partitioned into subsets of F-list E.g., for F-list=f-c-a-b-m-p, we can have patterns containing p, having m but no p, having c but no a nor b, m, p etc. Reduce irrelevant info infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) 58 29

Scaling FP-growth n FP-tree cannot fit in memory? à DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB n Parallel projection vs. Partition projection Parallel projection is space costly 59 FP-Growth vs. Apriori Run time(sec.) 100 90 80 70 60 50 40 30 20 Data set T25I20D10K D1 FP-grow th runtime D1 Apriori runtime 10 0 0 0.5 1 1.5 2 2.5 3 Support threshold(%) 60 30

Why Is FP-Growth the Winner? n Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases n Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops counting local freq items and building sub FP-tree, no pattern search and matching 61 Outline n Association Rule Mining Basic Concepts n Association Rule Mining Algorithms: Single-dimensional Boolean associations Multi-level associations Multi-dimensional associations n Association vs. Correlation n Adding constraints n Applications/extensions of frequent pattern mining n Summary 62 31