Chapters 6 & 7, Frequent Pattern Mining

Similar documents
Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

Unit II Association Rules

CSE 5243 INTRO. TO DATA MINING

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Data Mining Concepts & Techniques

CS 484 Data Mining. Association Rule Mining 2

FP-growth and PrefixSpan

COMP 5331: Knowledge Discovery and Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining

CS 412 Intro. to Data Mining

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

DATA MINING - 1DL360

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

CS 584 Data Mining. Association Rule Mining 2

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

732A61/TDDD41 Data Mining - Clustering and Association Analysis

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

DATA MINING - 1DL360

Data Mining and Analysis: Fundamental Concepts and Algorithms

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Data Analytics Beyond OLAP. Prof. Yanlei Diao

COMP 5331: Knowledge Discovery and Data Mining

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Association Analysis Part 2. FP Growth (Pei et al 2000)

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Association Rule Mining on Web

CS5112: Algorithms and Data Structures for Applications

Frequent Itemset Mining

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Associa'on Rule Mining

Association Analysis. Part 1

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

Machine Learning: Pattern Mining

Knowledge Discovery and Data Mining I

CSE 5243 INTRO. TO DATA MINING

Introduction to Data Mining

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Mining Non-Redundant Association Rules

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example

Descrip<ve data analysis. Example. E.g. set of items. Example. Example. Data Mining MTAT (6EAP)

Association Analysis 1

Frequent Itemset Mining

Association Analysis. Part 2

Handling a Concept Hierarchy

Frequent Itemset Mining

CSE 5243 INTRO. TO DATA MINING

Density-Based Clustering

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s

Temporal Data Mining

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

On Condensed Representations of Constrained Frequent Patterns

Frequent Itemset Mining

CS738 Class Notes. Steve Revilak

Data mining, 4 cu Lecture 5:

Chapter 4: Frequent Itemsets and Association Rules

Association)Rule Mining. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management

Mining Strong Positive and Negative Sequential Patterns

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Mining Positive and Negative Fuzzy Association Rules

CS246 Final Exam. March 16, :30AM - 11:30AM

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

Chapter 6: Mining Frequent Patterns, Association and Correlations

Frequent Pattern Mining: Exercises

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #11: Frequent Itemsets

Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F

Data Mining Concepts & Techniques

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

The Challenge of Mining Billions of Transactions

CPT+: A Compact Model for Accurate Sequence Prediction

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data

Mining Infrequent Patter ns

Chapter 5-2: Clustering

Pattern Space Maintenance for Data Updates. and Interactive Mining

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Sequential Pattern Mining

DATA MINING - 1DL105, 1DL111

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Un nouvel algorithme de génération des itemsets fermés fréquents

Frequent Pattern Mining. Toon Calders University of Antwerp

Pushing Tougher Constraints in Frequent Pattern Mining

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

Transcription:

CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Market Basket Problem Apriori Algorithm CHARM Algorithm Advanced Frequent Pattern Mining Advanced Association Rule Mining Constraint-Based Association Mining 1

Market Basket Problem Example Customers who bought beer also bought diapers. Motivation To promote sales in retail by cross-selling Required Data Customers purchase patterns ( Items often purchased together ) Applications Store arrangement Catalog design Discount plans Solving Market Basket Problem Basic Terms Transaction: a set of items (which are bought by one person at one time) Frequent itemset: a set of items (as a subset of a transaction) which occur frequently across transactions Association rule: one-direction relationship between two sets of items (e.g., A B ) Process Step 1, Generation of frequent itemsets e.g. { beer, nuts, diapers } Step 2, Generation of association rules e.g. { beer } { nuts, diapers } expected output knowledge 2

Frequent Itemsets Transaction Table T-ID Items 1 bread, eggs, milk, diapers 2 coke, beer, nuts, diapers 3 eggs, juice, beer, nuts 4 milk, beer, nuts, diapers 5 milk, beer, diapers Support Frequency of a set of items across transactions { milk, diapers }, { beer, nuts }, { beer, diapers } 60% support { milk, beer, diapers }, { beer, nuts, diapers } 40% support Frequent Itemsets Itemsets having support greater than (or equal to) a user-specified minimum support Association Rules Frequent Itemsets ( min sup = 60%, size 2 ) { milk, diapers } { beer, nuts } { beer, diapers } T-ID Items 1 bread, eggs, milk, diapers 2 coke, beer, nuts, diapers 3 eggs, juice, beer, nuts 4 milk, beer, nuts, diapers 5 milk, beer, diapers Confidence For A B, percentage of transactions containing A that also contain B {milk} {diapers}, {nuts} {beer} : 100% confidence {diapers} {milk}, {beer} {nuts}, {beer} {diapers}, and {diapers} {beer} : 75% confidence Association Rules Rules having confidence greater than (or equal to) a user-specified minimum confidence 3

Generalized Formulas Association Rules I = { I 1, I 2,, I m }, T = { T 1, T 2,, T n }, T k I for k A B where A I (A ), B I (B ), A T i for i, B T j for j, and A B = Computation of Support support (A B) = P(A U B) where P(X) = {T i X T i } / n Computation of Confidence confidence (A B) = P(B A) = P (A U B) P (A) Problem of Support & Confidence Support Table Tea Not Tea SUM Coffee 20 50 70 Not Coffee 10 20 30 SUM 30 70 100 Association Rule, {Tea} {Coffee} Support ({Tea} {Coffee})? Confidence ({Tea} {Coffee})? Problems in this Dataset? 4

Alternative Measures Lift lift (A B) = confidence (A B) P(B) The association rule A B is interesting if lift(a B) > 1 However, it is the same to correlation between A and B Correlation P (A U B) lift (A B) = = correlation (A, B) P (A) P (B) Positive correlation if correlation(a,b) > 1 Negative correlation if correlation(a,b) < 1 A B Alternative Measures cont d 2 Test (Chi-Square Test) Evaluates whether an observed distribution in a sample differs from a theoretical distribution (i.e., hypothesis). Where E i is an expected frequency and O i is an observed frequency, i 1 ( Oi Ei E n 2 2 ) i The larger 2, the more likely the variables are related (positively or negatively). Coverage coverage (A B) = P(A) 5

CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Market Basket Problem Apriori Algorithm CHARM Algorithm Advanced Frequent Pattern Mining Advanced Association Rule Mining Constraint-Based Association Mining Frequent Itemset Mining Process (1) Find frequent itemsets (2) Find association rules computational problem Brute Force Algorithm for Frequent Itemset Generation Enumerate all possible subsets of the total itemset, I Count frequency of each subset Select frequent itemsets Problem Enumerating all candidates is not computationally acceptable Efficient & scalable algorithm is required. 6

Apriori Algorithm Motivations Efficient frequent itemset analysis Scalable approach Process Iterative increment of the itemset size (1) Candidate itemset generation computational problem (2) Frequent itemset selection Downward Closure Property Any superset of an itemset X cannot have higher support than X. If an itemset X is frequent (support of X is higher than min. sup.), then any subset of X should be frequent. Candidate Itemset Generation Process Two steps: (1) selective joining and (2) a priori pruning Selective Joining Each candidate itemset with size k is generated by joining two frequent itemsets with size (k-1) The frequent itemsets with size (k-1) which share a frequent sub-itemset with size (k-2) are joined A priori Pruning A frequent itemset with size k which has any infrequent sub-itemsets with size (k-1) is pruned 7

Detail of Apriori Algorithm Basic Terms C k : Candidate itemsets of size k L k : Frequent itemsets of size k sup min : Minimum support Pseudo Code k 1 L k frequent itemsets with size 1 while L k do k k + 1 C k candidate itemsets by selective joining & a priori pruning from L (k-1) L k frequent itemsets using sup min end while return U k L k Example of Apriori Algorithm sup min = 2 T-ID Items 10 A, C, D 20 B, C, E 30 A, B, C, E C 1 Itemset Sup. {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 L1 Itemset Sup. {A} 2 {B} 3 {C} 3 {E} 3 40 B, E C 2 Itemset Sup. {B,C,E} 2 C 3 Itemset Sup. {A,C} 2 {B,C} 2 L 2 Itemset Sup. {A,B} 1 {A,C} 2 {A,E} 1 L 3 Itemset Sup. {B,C,E} 2 {B,E} 3 {C,E} 2 {B,C} 2 {B,E} 3 {C,E} 2 8

Summary of Apriori Algorithm Features An iterative approach of a level-wise search Reducing search space by downward closure property References Agrawal, R., Imielinski, T. and Swami, A., Mining Association Rules between Sets of Items in Large Databases, In Proceedings of ACM SIGMOD (1993) Agrawal, R. and Srikant, R., Fast Algorithms for Mining Association Rules, In Proceedings of VLDB (1994) Challenges of Apriori Algorithm Challenges Multiple scan of transaction database Huge number of candidates Tedious workload of support counting Solutions Reducing transaction database scans Shrinking number of candidates Facilitating support counting 9

CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Market Basket Problem Apriori Algorithm CHARM Algorithm Advanced Frequent Pattern Mining Advanced Association Rule Mining Constraint-Based Association Mining Association Rule Mining Process (1) Find frequent itemsets computational problem (2) Find association rules redundant rule generation Example 1 { beer } { nuts } ( 40% support, 75% confidence ) { beer } { nuts, diapers } ( 40% support, 75% confidence ) The first rule is not meaningful. Example 2 { beer } { nuts } ( 60% support, 75% confidence ) { beer, diapers } { nuts } ( 40% support, 75% confidence ) Both rules are meaningful. 10

Frequent Closed Itemsets General Definition of Closure A frequent itemset X is closed if there exists no superset of X, Y X, with the same support as X. Different from frequent maximal itemsets Frequent Closed Itemsets with Min. Support of 40% { milk, diapers } 60% T-ID Items { milk, beer } 40% 1 bread, eggs, milk, diapers { beer, nuts } 60% 2 coke, beer, nuts, diapers { beer, diapers } 60% 3 eggs, juice, beer, nuts { nuts, diapers } 40% 4 milk, beer, nuts, diapers { milk, beer, diapers } 40% 5 milk, beer, diapers { beer, nuts, diapers } 40% Mapping of Items and Transactions Mapping Functions I = { I 1, I 2,, I m }, T = { T 1, T 2,, T n }, X I, Y T i: T I, i(y): itemset that is contained in all transactions in Y t: I T, t(x): set of transactions (tidset) that contain all items in X Properties X 1 X 2 t(x 1 ) t(x 2 ) (e.g.) {ACW} {ACTW} {1345} {135} Y 1 Y 2 i(y 1 ) i(y 2 ) (e.g.) {245} {2456} {CDW} {CD} X i( t(x) ), Y t( i(y) ) (e.g.) t({ac}) = {1345}, i({1345}) = {ACW} (e.g.) i({134}) = {ACW}, t({acw}) = {1345} T-ID Items 1 A, C, T, W 2 C, D, W 3 A, C, T, W 4 A, C, D, W 5 A, C, D, T, W 6 C, D, T 11

Definition of Closure Closure Operator c it (X) = i( t(x) ), c ti (Y) = t( i(y) ) Formal Definition of Closure An itemset X is closed if X = c it (X) A tid-set Y is closed if Y = c ti (Y) Items Transactions Items t X t(x 1 1 ) t X 2 i ( t(x 1 ) ) X 1 is not closed. i t(x 2 ) i X 2 is closed. i ( t(x 2 ) ) Examples of Closed Itemsets Examples X = {ACW} support(x) = 67% t(x) = {1345}, i( t(x) ) = {ACW} X is closed. X = {AC} support(x) = 67% t(x) = {1345}, i( t(x) ) = {ACW} X is not closed. X = {ACT} support(x) = 50% t(x) = {135}, i( t(x) ) = {ACTW} X is not closed. X = {CT} support(x) = 67% t(x) = {1356}, i( t(x) ) = {CT} X is closed. T-ID Items 1 A, C, T, W 2 C, D, W 3 A, C, T, W 4 A, C, D, W 5 A, C, D, T, W 6 C, D, T 12

CHARM Algorithm Motivations Efficient frequent closed itemset analysis Non-redundant rule generation Property Simultaneous exploration of itemset space and tid-set space Not enumerating all possible subsets of a closed itemset Early pruning strategy for infrequent and non-closed itemsets Process for each itemset pair (1) computing the frequency of their union set (2) pruning all infrequent and non-closed branches Frequency Computation Operation Tid-set of the union of two itemsets, X 1 and X 2 Intersection of two tid-sets, t(x 1 ) and t(x 2 ) t ( X 1 U X 2 ) = t (X 1 ) t (X 2 ) Example X 1 = {AC}, X 2 = {D} t(x 1 U X 2 ) = t({acd}) = {45} t(x 1 ) t(x 2 ) = {1345} {2456} = {45} T-ID Items 1 A, C, T, W 2 C, D, W 3 A, C, T, W 4 A, C, D, W 5 A, C, D, T, W 6 C, D, T 13

Pruning Strategy Pruning Suppose two itemsets X 1 X 2 (1) t(x 1 ) = t(x 2 ) t(x 1 ) t(x 2 ) = t(x 1 ) = t(x 2 ) Replace X 1 with (X 1 U X 2 ), and prune X 2 (2) t(x 1 ) t(x 2 ) t(x 1 ) t(x 2 ) = t(x 1 ) t(x 2 ) Replace X 1 with (X 1 U X 2 ), and keep X 2 (3) t(x 1 ) t(x 2 ) t(x 1 ) t(x 2 ) = t(x 2 ) t(x 1 ) Replace X 2 with (X 1 U X 2 ), and keep X 1 (4) t(x 1 ) t(x 2 ) t(x 1 ) t(x 2 ) t(x 1 ) t(x 2 ) Keep X 1 and X 2 Example of CHARM Algorithm Subset Lattice { } {12345} {A} {1345} {C} {123456} {D} {T} {W} {2456} {1356} {AC} {1345} {AD} {AT} {AW} {12345} {CD} {CT} {CW} {DT} {2456}{1356} {DW} {TW} {135} {ACD} {ACT} {ACW} {ADT} {ADW} {ATW} {CDT} {CDW} {CTW} {DTW} {45} {135} {1345} {56} {245} T-ID Items 1 A, C, T, W {ACDT} {ACDW} {ACTW} {ADTW} {CDTW} 2 C, D, W {135} 3 A, C, T, W 4 A, C, D, W 5 A, C, D, T, W {ACDTW} 50% minimum support 6 C, D, T 14

Summary of CHARM Algorithm Advantages No need multiple scan of transaction database Revision and enhancement of Apriori algorithm No loss of information References Zaki, M.J., Generating Non-Redundant Rule Generation, In Proceedings of ACM SIGKDD (2000) Zaki, M.J. and Hsiao, C.-J., CHARM: An Efficient Algorithm for Closed Itemset Mining, In Proceedings of SDM (2002) CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Market Basket Problem Apriori Algorithm CHARM Algorithm Advanced Frequent Pattern Mining Advanced Association Rule Mining Constraint-Based Association Mining 15

Frequent Pattern Mining Definition A frequent pattern: A pattern (a set of items, sub-sequences, sub-structures, etc.) that occurs frequently in a data set Motivation Finding inherent regularities in data e.g., What products were often purchased together? e.g., What are the subsequent purchases after buying a PC? e.g., What kinds of DNA sequences are sensitive to this new drug? e.g., Can we find web documents similar to my research? Applications Market basket analysis, DNA sequence analysis, Web log analysis Why Frequent Pattern Mining? Importance A frequent pattern is an intrinsic and important property of data sets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (sub-graph) pattern analysis Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data pre-processing: data reduction and compression Data warehousing: iceberg cube computation 16

Sampling Approach Motivation Problem: Typically huge data size Mining a subset of the data to reduce candidate search space Trade-off some degree of accuracy against efficiency Process (1) Selecting a set of random samples from the original database (2) Mining frequent itemsets with the set of samples using Apriori (3) Verifying the frequent itemsets on the border of closure of frequent itemsets Reference Toivonen, H., Sampling large databases for association rules. In Proceedings of VLDB (1996) Partitioning Approach Motivation Problem: Typically huge data size Partitioning data to reduce candidate search space Process (1) Partitioning database and find local frequent patterns (2) Consolidating global frequent patterns Reference Savasere, A., Omiecinski, E. and Navathe, S., An efficient algorithm for mining association in large databases. In Proceeding of VLDB (1995) 17

Hashing Approach Motivation Problem: A very large number of candidates generated The process in the initial iteration (e.g., size-2 candidate generation) dominates the total execution cost Hashing itemsets to reduce the size of candidates Process (1) Hashing itemsets into several buckets in a hash table (2) If a k-itemset whose corresponding hashing bucket count is below the min support, it cannot be frequent, thus should be removed Reference Park, J.S., Chen, M.S. and Yu, P., An efficient hash-based algorithm for mining association rules. In Proceedings of SIGMOD (1995) Pattern Growth Approach Motivation Problem: A very large number of candidates generated Finding frequent itemsets without candidate generation Grows short patterns to long ones using local frequent items only Depth-first search ( Apriori: Breadth-first search, Level-wise search ) Example abc is a frequent pattern d is a frequent item abcd is a frequent pattern Reference Han, J., Pei, J. and Yin, Y. Mining frequent patterns without candidate generation. In Proceedings of SIGMOD (2000) 18

FP(Frequent Pattern)-Tree FP-Tree Construction Scan DB once to find all frequent 1-itemsets Sort frequent items in a descending order of support, called f-list Scan DB again to construct FP-tree Example TID items bought ordered frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 Item Support Link f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 a:3 b:1 b:1 p:1 m:2 p:2 b:1 m:1 Benefits of FP Tree Structure Compactness Reduce irrelevant (infrequent) items Reduce common prefix items of patterns Order items in the descending order of support (The more frequent occurring, the more likely to be shared.) Never be larger than the original database Completeness Preserve complete information of frequent patterns Never break any long patterns 19

Conditional Pattern Bases Conditional Pattern Base Construction Traverse the FP-tree by following the link of each frequent item p Accumulate all prefix paths of p to form p s conditional pattern base Example Item Support Link f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 a:3 b:1 b:1 p:1 m:2 p:2 b:1 m:1 item conditional pattern base f - c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Conditional FP-Trees Conditional FP-Tree Construction For each pattern base, accumulate the count for each item Construct the conditional FP-tree with frequent items of the pattern base Example {} m-conditional FP-tree Item Support Link f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 a:3 b:1 b:1 p:1 m:2 p:2 b:1 m:1 {} f:3 c:3 a:3 frequent patterns m, fm, cm, am, fcm, fam, cam, fcam 20

Algorithm of Pattern Growth Mining Algorithm (1) Construct FP tree (2) For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree (3) Repeat (2) recursively on each newly created conditional FP-tree until the resulting FP-tree is empty, or it contains only a single path (4) The single path will generate all the combinations of its sub-paths, each of which is a frequent pattern Advanced Techniques To fit an FP-tree in memory, partitioning a database into a set of projected databases Mining the FP-tree for each projected database Mining Frequent Maximal Patterns 1 st Round A, B, C, D, E 2 nd Round AB, AC, AD, AE, ABCDE BC, BD, BE, BCDE CD, CE, CDE DE min_support = 2 Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F potential maximal pattern 3 rd Round ACD Reference Bayardo, R., Efficiently mining long patterns from databases, In Proceedings of SIGMOD (1998) 21

CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Market Basket Problem Apriori Algorithm CHARM Algorithm Advanced Frequent Pattern Mining Advanced Association Rule Mining Constraint-Based Association Mining Mining Multi-Level Association Rules Motivation Items often form hierarchies Setting flexible supports Items at the lower level are expected to have lower support Example (uniform support) Level 1 min_sup = 5% Milk [support = 10%] (reduced support) Level 1 min_sup = 5% Level 2 min_sup = 5% 2% Milk [support = 6%] Skim Milk [support = 4%] Level 2 min_sup = 3% 22

Redundant Rule Filtering Motivation Some rules may be redundant due to ancestor relationships between items Example { milk } { bread } [support=10%, confidence=70%] { 2% milk } { bread } [support=6%, confidence=72%] { milk } { wheat bread } [support=5%, confidence=38%] Redundant if its support and confidence are close to expected value based on the rule s ancestor. Mining Multidimensional Association Rules Single Dimensional Association Rules buys( milk ) buys( bread ) Multi-Dimensional Association Rules Rules with more than 2 dimensions Inter-dimensional association rules: ages( 18-25 ) ^ occupation( student ) buys( coke ) Hybrid-dimensional association rules: ages( 18-25 ) ^ buys( popcorn ) buys( coke ) Attributes in Association Rules Categorical attributes Quantitative attributes discretization 23

Quantitative Rule Mining Motivation The range of numeric attributes can be changed dynamically to maximize the confidence Example age( 32-38 ) ^ income( 40k-50K ) buys( HDTV ) Binning to partition the range Grouping the adjacent ranges The combination of grouping and binning New rule: age( 34-35 ) ^ income( 30k-50k ) buys( HDTV ) CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Market Basket Problem Apriori Algorithm CHARM Algorithm Advanced Frequent Pattern Mining Advanced Association Rule Mining Constraint-Based Association Mining 24

Constraint-based Mining Motivation Finding all the patterns (association rules) in a database? Too many, diverse patterns Users can give directions (constraints) for mining patterns Properties User flexibility Users can provide any constraints on what to be mined System optimization It reduces the search space for efficient mining Constraint Types Knowledge Type Constraints Association, Classification, etc. Data Constraints Selects data having specific values using SQL-like queries ex, sales in Waco on Sep Dimension/Level Constraints Selects specific dimensions or levels of the concept hierarchies Interestingness Constraints Uses interestingness measures, ex, support, confidence, correlation Rule Constraints Specifies rules (or meta-rules) to be mined 25

Meta-Rule Constraints Definition Rule templates using the maximum or minimum number of predicates occurring in a rule, or the relationships among attributes, attribute values or aggregates Meta-Rule-Guided Mining Finding rules between the customer attributes (age, address, credit rate) and the item purchased P 1 (X) ^ P 2 (Y) buys ( coke ) Sample result: age( 18~25 ) ^ income( 30k~40k ) buys( coke ) Rule Constraint Types Anti-monotonic Constraints If a constraint c is violated, then its further mining is terminated Monotonic Constraints If a constraint c is satisfied, then its further mining is redundant Succinct Constraints The itemsets satisfying a constraint c can be directly generated Convertible Constraints A constraint c is not monotonic nor anti-monotonic, but it can be converted if items are properly ordered 26

Anti-Monotonicity in Constraints Definition A constraint c is anti-monotonic, if a pattern satisfies c, all of its sub-patterns satisfy c too If an itemset S violates c, so does any of its supersets Examples count(s) < 3 Anti-monotonic count(s) 4 Not anti-monotonic sum(s.price) 100 Anti-monotonic sum(s.price) 150 Not anti-monotonic sum(s.profit) 80 Not anti-monotonic support(s) 2 Anti-monotonic TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Price Profit a 100 40 b 50 0 c 60-20 d 80 10 e 100-30 f 70 30 g 95 20 h 100-10 Monotonicity in Constraints Definition A constraint c is monotonic, if a pattern satisfies c, all of its super-patterns satisfy c too If an itemset S satisfies c, so does any of its supersets Examples count(s) 2 Monotonic sum(s.price) 100 Not monotonic sum(s.price) 150 Monotonic sum(s.profit) 100 Not monotonic min(s.price) 80 Monotonic min(s.price) > 70 Not monotonic TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Price Profit a 100 40 b 50 0 c 60-20 d 80 10 e 100-30 f 70 30 g 95 20 h 100-10 27

Succinctness in Constraints Definition A constraint c is succinct, if an itemset satisfying c can be generated before support counting All and only the itemsets satisfying c can be enumerated. Examples min(s.price) < 80 Succinct sum(s.price) 150 Not sunccinct TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Price Profit a 100 40 b 50 0 c 60-20 d 80 10 e 100-30 f 70 30 g 95 20 h 100-10 Converting Constraints Definition A constraint c is convertible, if c is not anti-monotonic nor monotonic, but c becomes anti-monotonic or monotonic when items are properly ordered TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Example avg(s.price) > 80 Neither anti-monotonic nor monotonic if items are in a value-descending order <a, e, h, g, d, f, c, b>, then anti-monotonic if items are in a value-ascending order <b, c, f, d, g, a, e, h>, then monotonic (strongly) Convertible Item Price Profit a 100 40 b 50 0 c 60-20 d 80 10 e 100-30 f 70 30 g 95 20 h 100-10 28

Anti-Monotonic Constraints in Apriori Handling Anti-monotonic Constraints Can apply apriori pruning Example: sum(s.price) < 5 L 1 Itemset Sup. {1} 2 C 1 Itemset Sup. T-ID Items {2} 3 {1} 2 10 1, 3, 4 {3} 3 {2} 3 20 2, 3, 5 {5} 3 {3} 3 30 1, 2, 3, 5 {4} 1 40 2, 5 C {5} 3 2 Itemset Sup. L {1,2} 1 Itemset Sup. 2 {1,3} 2 {1,3} 2 {2,3} 2 {2,3} 2 Monotonic Constraints in Apriori Handling Monotonic Constraints Cannot apply apriori pruning L 1 Example: sum(s.price) 3 T-ID Items C 1 Itemset Sup. 10 1, 3, 4 {1} 3 20 1, 2, 3 {2} 3 30 1, 2, 3, 5 {3} 3 40 2, 5 {4} 1 {5} 2 L 3 Itemset Sup. Itemset Sup. C 3 {1,2} 2 {1,2,3} 2 Itemset Sup. {1,3} 2 {1,2,3} 2 {2,3} 3 {2,5} 2 Itemset Sup. {1} 3 {2} 3 {3} 3 {5} 2 C 2 Itemset Sup. {1,2} 2 L 2 {1,3} 3 {1,5} 1 {2,3} 2 {2,5} 2 {3,5} 1 29

Examples of Constraints Constraint Anti-Monotone Monotone Succinct v S no yes yes S V no yes yes S V yes no yes min(s) v no yes yes min(s) v yes no yes max(s) v yes no yes max(s) v no yes yes count(s) v yes no weakly count(s) v no yes weakly sum(s) v ( a S, a 0 ) yes no no sum(s) v ( a S, a 0 ) no yes no range(s) v yes no no range(s) v no yes no avg(s) v, {,, } convertible convertible no support(s) yes no no support(s) no yes no Classification of Constraints Anti-Monotone Monotone Succinct Convertible Anti-Monotone Strongly convertible Convertible Monotone Inconvertible 30

Questions? Lecture Slides on the Course Website, www.ecs.baylor.edu/faculty/cho/4352 31