CS 484 Data Mining. Association Rule Mining 2

Similar documents
Data Mining Concepts & Techniques

Lecture Notes for Chapter 6. Introduction to Data Mining

DATA MINING - 1DL360

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING - 1DL360

COMP 5331: Knowledge Discovery and Data Mining

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

CS 584 Data Mining. Association Rule Mining 2

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Descrip<ve data analysis. Example. E.g. set of items. Example. Example. Data Mining MTAT (6EAP)

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s

Association)Rule Mining. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

Data Mining and Analysis: Fundamental Concepts and Algorithms

Association Analysis 1

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

CSE4334/5334 Data Mining Association Rule Mining. Chengkai Li University of Texas at Arlington Fall 2017

732A61/TDDD41 Data Mining - Clustering and Association Analysis

Association Analysis Part 2. FP Growth (Pei et al 2000)

ST3232: Design and Analysis of Experiments

Frequent Itemset Mining

Chapters 6 & 7, Frequent Pattern Mining

Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

CS5112: Algorithms and Data Structures for Applications

CSE 5243 INTRO. TO DATA MINING

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Association Analysis. Part 1

Machine Learning: Pattern Mining

Knowledge Discovery and Data Mining I

CSE 5243 INTRO. TO DATA MINING

Set Notation and Axioms of Probability NOT NOT X = X = X'' = X

Unit II Association Rules

Math for Liberal Studies

Temporal Data Mining

Frequent Itemset Mining

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Frequent Itemset Mining

Handling a Concept Hierarchy

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Frequent Itemset Mining

FP-growth and PrefixSpan

Association Rule Mining on Web

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Lecture 3 : Probability II. Jonathan Marchini

Association Analysis. Part 2

CS 412 Intro. to Data Mining

Introduction to Data Mining

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

COMP 5331: Knowledge Discovery and Data Mining

Fractional Factorials

Do in calculator, too.

Four Paradigms in Data Mining

Chapter 1 Problem Solving: Strategies and Principles

On Condensed Representations of Constrained Frequent Patterns

Approximate counting: count-min data structure. Problem definition

Lecture 2 31 Jan Logistics: see piazza site for bootcamps, ps0, bashprob

Mining Approximative Descriptions of Sets Using Rough Sets

Associa'on Rule Mining

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Finding extremal voting games via integer linear programming

sor exam What Is Association Rule Mining?

CPT+: A Compact Model for Accurate Sequence Prediction

OPPA European Social Fund Prague & EU: We invest in your future.

Data Mining Concepts & Techniques

Fractional designs and blocking.

Mining Free Itemsets under Constraints

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #11: Frequent Itemsets

Chapter 11: Factorial Designs

DATA MINING - 1DL105, 1DL111

Sequential Pattern Mining

Experimental design (DOE) - Design

Data Mining and Analysis: Fundamental Concepts and Algorithms

Detecting Functional Dependencies Felix Naumann

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

Design theory for relational databases

Density-Based Clustering

TWO-LEVEL FACTORIAL EXPERIMENTS: REGULAR FRACTIONAL FACTORIALS

Discovering Binding Motif Pairs from Interacting Protein Groups

Session 3 Fractional Factorial Designs 4

Transcription:

CS 484 Data Mining Association Rule Mining 2

Review: Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: X, Y : ( X Y ) s( X ) s( Y ) Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support

Candidate Generation Three basic approaches: Brute-force method F k-1 x F 1 method F k-1 x F k-1 method The next three slides demonstrate how each method generates candidate 3-itemsets 3

Brute-Force Method TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Min support count = 3 (minsup = 60%)

F k-1 x F 1 method TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Min support count = 3 (minsup = 60%)

F k-1 x F k-1 method TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Min support count = 3 (minsup = 60%) Only merge a pair of frequent (k-1)-itemsets if their first k-2 items are identical!

Candidate Pruning null A B C D E AB AC AD AE BC BD BE CD CE DE Found to be Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned supersets ABCDE

Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If L = k, then there are 2 k 2 candidate association rules (ignoring L and L)

Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an antimonotone property c(abc D) can be larger or smaller than c(ab D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(abc D) c(ab CD) c(a BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

Theorem If Rule X à Y X does not satisfy the confidence threshold then any rule X à Y X where X is a subset of X does not satisfy the confidence threshold as well.

Rule Generation for Apriori Algorithm Lattice of rules Low Confidence Rule ABCD=>{ } BCD=>A ACD=>B ABD=>C ABC=>D CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD Pruned Rules D=>ABC C=>ABD B=>ACD A=>BCD

Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share the same prefix in the rule consequent CD=>AB BD=>AC join(cd=>ab,bd=>ac) would produce the candidate rule D => ABC Prune rule D=>ABC if its super-set AD=>BC does not have high confidence D=>ABC

Reducing Number of Comparisons Candidate counting: Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

Subset Operation (Enumeration) Given a transaction t, what are the possible subsets of size 3? Transaction, t 1 2 3 5 6 Level 1 1 2 3 5 6 2 3 5 6 3 5 6 Level 2 1 2 3 5 6 1 3 5 6 1 5 6 2 3 5 6 2 5 6 3 5 6 1 2 3 1 2 5 1 2 6 1 3 5 1 3 6 1 5 6 2 3 5 2 3 6 2 5 6 3 5 6 Level 3 Subsets of 3 items

Generate Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) Hash function 3,6,9 1,4,7 2,5,8 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 1 3 6 3 5 7 6 8 9 1 5 9 3 6 7 3 6 8

Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 1, 4 or 7 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8

Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 2, 5 or 8 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8

Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 3, 6 or 9 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8

Subset Operation Using Hash Tree 1 2 3 5 6 transaction Hash Function 1 + 2 3 5 6 2 + 3 5 6 1,4,7 3,6,9 3 + 5 6 2,5,8 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 Level 1 Level 2 Transaction, t 1 2 3 5 6 1 2 3 5 6 2 3 5 6 1 2 3 5 6 1 3 5 6 1 5 6 2 3 5 6 2 5 6 1 2 3 1 2 5 1 2 6 1 3 5 1 3 6 1 5 6 2 3 5 2 3 6 3 5 6 3 5 6 2 5 6 3 5 6 Level 3 Subsets of 3 items

Subset Operation Using Hash Tree 1 2 3 5 6 transaction Hash Function 1 2 + 1 3 + 3 5 6 5 6 1 + 2 3 5 6 2 + 3 5 6 3 + 5 6 1,4,7 2,5,8 3,6,9 1 5 + 6 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 3 4 5 3 5 6 3 6 7 3 5 7 3 6 8 6 8 9

Subset Operation Using Hash Tree 1 2 3 5 6 transaction Hash Function 1 2 + 1 3 + 3 5 6 5 6 1 + 2 3 5 6 2 + 3 5 6 3 + 5 6 1,4,7 2,5,8 3,6,9 1 5 + 6 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 Match transaction against 9 out of 15 candidates

Factors Affecting Complexity Choice of minimum support threshold Lowering support threshold results in more frequent itemsets This may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set More space is needed to store support count of each item If number of frequent items also increases, both computation and I/O costs may also increase Size of database Since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width Transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)