Association Analysis Part 2. FP Growth (Pei et al 2000)

Size: px

Start display at page:

Download "Association Analysis Part 2. FP Growth (Pei et al 2000)"

Hugh Melton
6 years ago
Views:

1 Association Analysis art 2 Sanjay Ranka rofessor Computer and Information Science and Engineering University of Florida F Growth ei et al 2 Use a compressed representation of the database using an F-tree Once an F-tree has been constructed, it uses recursive divide and conquer approach to mine frequent itemsets Data Mining Sanjay Ranka Fall 23 2

2 TID Items {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} {B,C,E} F-Tree Construction After reading TID=: A: null Data Mining Sanjay Ranka Fall 23 3 B: After reading TID=2: null B: A: B: C: D: F-Tree Construction TID Items {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} {B,C,E} Transaction Database B:5 A:7 C: null D: B:3 C:3 Header table Item A B C D E ointer C:3 D: D: D: E: E: D: ointers are used to assist frequent itemset generation E: Data Mining Sanjay Ranka Fall

3 F-Tree Growth D: C:3 B:5 A:7 D: null C: D: D: B: C: D: Conditional attern base for D: = {A:,B:,C:, A:,B:, A:,C:, A:, B:,C:} Recursively apply Fgrowth on Frequent Itemsets found with sup > : AD, BD, CD, ACD, BCD Data Mining Sanjay Ranka Fall 23 5 Tree rojection Agrawal et al 999 Set enumeration tree: null ossible Extension: EA = {B,C,D,E} A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ossible Extension: EABC = {D,E} ABCD ABCE ABDE ACDE BCDE ABCDE Data Mining Sanjay Ranka Fall

4 Tree rojection Items are listed in lexicographic order Each node stores the following information: Itemset for node List of possible lexicographic extensions of : E ointer to projected database of its ancestor node Bit vector containing information about which transactions in the projected database containing the itemset Data Mining Sanjay Ranka Fall 23 7 rojected Database Original Database: TID Items {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} {B,C,E} rojected Database for node A: TID Items {B} 2 {} 3 {C,D,E} 4 {D,E} 5 {B,C} 6 {B,C,D} 7 {} 8 {B,C} 9 {B,D} {} For each transaction T, projected transaction at node A is T EA Data Mining Sanjay Ranka Fall

5 ECLAT Zaki 2 For each item, store a list of transaction ids tids Horizontal Data Layout TID Items A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D B Vertical Data Layout A B C D E TID-list Data Mining Sanjay Ranka Fall 23 9 A ECLAT Determine support of any k-itemset by intersecting tidlists of two of its k- subsets. AB B traversal approaches: top-down, bottom-up and hybrid Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory Data Mining Sanjay Ranka Fall 23 5

Interestingness Measures Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant

derived patterns In the original formulation of association rules, support & confidence are the only measures used Data Mining

Data Selected Data Fe atur Fe eatur Fe eatur Fe eatur Fe eatur Fe eatur Fe eatur Fe eatur Fe eatur Fe eatur e rod rod uct rod

6 Interestingness Measures Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used Data Mining Sanjay Ranka Fall 23 Application of Interestingness Measure Interestingness Measures atterns Knowledge ostprocessing reprocessed Data Selected Data Fe atur Fe eatur Fe eatur Fe eatur Fe eatur Fe eatur Fe eatur Fe eatur Fe eatur Fe eatur e rod rod uct rod uct rod uct rod uct rod uct rod uct rod uct rod uct rod uct uct Mining Data reprocessing Selection Data Mining Sanjay Ranka Fall

7 Computing Interestingness Measure Given a rule, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for f f f f f o+ f + T f + f + f : support of and f : support of and f : support of and f : support of and Can apply various Measures Support, Confidence, Lift, Gini, J-measure, etc. Data Mining Sanjay Ranka Fall 23 3 Drawback of Confidence Coffee Coffee Tea Tea Association Rule: Tea Coffee Confidence= Coffee Tea =.75 but Coffee =.9 Although confidence is high, rule is misleading Coffee Tea =.9375 Data Mining Sanjay Ranka Fall

8 8 Data Mining Sanjay Ranka Fall 23 5 Other Measures, ] [ ] [,,, IS coefficient S Interest Lift = = = = = φ Data Mining Sanjay Ranka Fall 23 6 Other Measures

9 roperties of a Good Measure iatetsky-shapiro: Three properties a good measure M should satisfy: MA,B = if A and B are statistically independent MA,B increase monotonically with A,B when A and B remain unchanged MA,B decreases monotonically with A [or B] when A,B and B [or A] remain unchanged Data Mining Sanjay Ranka Fall 23 7 Lift and Interest Lift = =.9 Lift = = Statistical independence: If,= => Lift = Data Mining Sanjay Ranka Fall

10 Comparing Different Measures examples of contingency tables: Rankings of contingency tables using various measures: Example f f f f E E E E E E E E E E Data Mining Sanjay Ranka Fall 23 9 roperty Under Variable ermutation B B A p q A r s A A B p r B q s Does MA,B = MB,A? Symmetric measures: support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: confidence, conviction, Laplace, J-measure, etc Data Mining Sanjay Ranka Fall 23 2

11 roperty Under Row / Column Scaling Grade-Gender Example Mosteller, 968: Male Female Male Female High High Low 4 5 Low x x Mosteller: Underlying association should be independent of the relative number of male and female students in the samples Data Mining Sanjay Ranka Fall 23 2 roperty Under Inversion Operation. Transaction Transaction N A B C D E F a b c Data Mining Sanjay Ranka Fall 23 22

12 Example: φ-coefficient φ-coefficient is analogous to correlation coefficient for continuous variables φ = =.5238 Data Mining Sanjay Ranka Fall φ Coefficient is the same for both tables φ = =.5238 roperty Under Null Addition B B A p q A r s B B A p q A r s + k Invariant measures: support, cosine, Jaccard, etc Non-invariant measures: correlation, Gini, mutual information, odds ratio, etc Data Mining Sanjay Ranka Fall

13 Different Measures Have Different roperties Sym bol Measure Range 2 3 O O2 O3 O3' O4 Φ Correlation - es es es es No es es No λ Lambda es No No es No No* es No α Odds ratio es* es es es es es* es No Q ule's Q - es es es es es es es No ule's - es es es es es es es No κ Cohen's - es es es es No No es No M Mutual Information es es es es No No* es No J J-Measure es No No No No No No No G Gini Index es No No No No No* es No s Support No es No es No No No No c Confidence No es No es No No No es L Laplace No es No es No No No No V Conviction.5 No es No es** No No es No I Interest es* es es es No No No No IS IS cosine.. No es es es No No No es S iatetsky-shapiro's es es es es No es es No F Certainty factor - es es es No No No es No AV Added value.5 es es es No No No No No S Collective strength No es es es No es* es No ζ Jaccard.. No es es es No No No es 2 2 K Klosgen's 2 3 es es es No No No No No Data Mining Sanjay Ranka Fall Support Based runing Most of the association rule mining algorithms use support measure to prune rules and itemsets Study effect of support pruning on correlation of itemsets Generate random contingency tables Compute support and pairwise correlation for each table Apply support-based pruning and examine the tables that are removed Data Mining Sanjay Ranka Fall

14 Effect of Support Based runing All Itempairs Correlation.2.3 Data Mining Sanjay Ranka Fall Effect of Support Based runing Support <. Support < Correlation Correlation Support <.5 3 Support-based pruning eliminates mostly negatively correlated itemsets Correlation Data Mining Sanjay Ranka Fall

Effect of Support Based runing Investigate how support-based pruning affects other measures Steps: Generate contingency tables Rank each table according to the different measures Compute the

15 Effect of Support Based runing Investigate how support-based pruning affects other measures Steps: Generate contingency tables Rank each table according to the different measures Compute the pair-wise correlation between the measures Data Mining Sanjay Ranka Fall Effect of Support Based runing Without Support runing All airs Conviction Odds ratio Col Strength Correlation Interest S CF ule Reliability Kappa Klosgen ule Q Confidence Laplace IS Support Jaccard Lambda Gini J-measure All airs 4.4% Jaccard Correlation M utual Info Red cells indicate correlation between the pair of measures > % pairs have correlation >.85 Scatter lot between Correlation & Jaccard Measure Data Mining Sanjay Ranka Fall

Effect of Support Based runing.5% support 5% Interest Conviction Odds ratio Col Strength Laplace Confidence Correlation Klosgen Reliability S ule Q CF ule Kappa IS Jaccard.5 <= support <=.5 6.

16 Effect of Support Based runing.5% support 5% Interest Conviction Odds ratio Col Strength Laplace Confidence Correlation Klosgen Reliability S ule Q CF ule Kappa IS Jaccard.5 <= support <= % Jaccard Support Lambda Gini J-measure M utual Info % pairs have correlation > Correlation Scatter lot between Correlation & Jaccard Measure: Data Mining Sanjay Ranka Fall 23 3 Effect of Support Based runing.5% support 3%.5 <= support <= % Support Interest Reliability Conviction ule Q Odds ratio Confidence CF ule Kappa Correlation Col Strength IS Jaccard Laplace S Klosgen Lambda M utual Info Gini J-measure % pairs have correlation >.85 Jaccard Correlation Scatter lot between Correlation & Jaccard Measure Data Mining Sanjay Ranka Fall

Subjective Interestingness Measure Objective measure: Rank patterns based on statistics computed from data e.g., 2 measures of association support, confidence, Laplace, Gini, mutual information, Jaccard, etc.

17 Subjective Interestingness Measure Objective measure: Rank patterns based on statistics computed from data e.g., 2 measures of association support, confidence, Laplace, Gini, mutual information, Jaccard, etc. Subjective measure: Rank patterns according to user s interpretation A pattern is subjectively interesting if it contradicts the expectation of a user Silberschatz & Tuzhilin A pattern is subjectively interesting if it is actionable Silberschatz & Tuzhilin Data Mining Sanjay Ranka Fall Interestingness vs. Unexpectedness Need to model expectation of users domain knowledge + attern expected to be frequent - attern expected to be infrequent attern found to be frequent attern found to be infrequent Expected atterns + Unexpected atterns Need to combine expectation of users with evidence from data i.e., extracted patterns Data Mining Sanjay Ranka Fall

18 Interestingness vs. Unexpectedness Web Data Cooley et al 2 Domain knowledge in the form of site structure Given an itemset F = {, 2,, k } i : Web pages L: number of links connecting the pages lfactor = L / k k- cfactor = if graph is connected, disconnected graph Structure evidence = cfactor lfactor Usage evidence =... k... Use Dempster-Shafer theory to combine domain knowledge and evidence from data 2 2 k Data Mining Sanjay Ranka Fall Multiple Minimum Support Is it sufficient to use only a single minimum support threshold? In practice, each item has different frequency If minimum support is set too high, we could miss rules involving the rare items If minimum support is set too low, it is computationally intensive and tends to produce too many rules Data Mining Sanjay Ranka Fall

19 Multiple Minimum Support How to apply multiple minimum supports? MSi: minimum support for item i e.g.: MSMilk=5%, MSCoke = 3%, MSBroccoli=.%, MSSalmon=.5% MS{Milk, Broccoli} = min MSMilk, MSBroccoli =.% Challenge: Support is no longer anti-monotone Suppose: SupportMilk, Coke =.5% and SupportMilk, Coke, Broccoli =.5% {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent Data Mining Sanjay Ranka Fall Multiple Minimum Support Item MSI SupI AB ABC A.%.25% A AC AD ABD ABE B.2%.26% B AE ACD C.3%.29% C BC BD ACE ADE D.5%.5% D BE BCD E 3% 4.2% E CD CE BCE BDE Data Mining Sanjay Ranka Fall DE CDE 9

20 Multiple Minimum Support Item MSI SupI A.%.25% AB ABC AC ABD A AD ABE B.2%.26% C.3%.29% B C AE BC BD ACD ACE ADE D.5%.5% E 3% 4.2% D E BE CD CE BCD BCE BDE Data Mining Sanjay Ranka Fall DE CDE Multiple Minimum Support Liu 999 Order the items according to their minimum support in ascending order e.g.: MSMilk=5%, MSCoke = 3%, MSBroccoli=.%, MSSalmon=.5% Ordering: Broccoli, Salmon, Coke, Milk Need to modify Apriori such that: L : set of frequent items F : set of items whose support is MS where MS is min i MSi C 2 : candidate itemsets of size 2 is generated from F instead of L Data Mining Sanjay Ranka Fall

21 Multiple Minimum Support Liu 999 Modifications to Apriori: In traditional Apriori, A candidate k+-itemset is generated by merging two frequent itemsets of size k The candidate is pruned if it contains any infrequent subsets of size k runing step has to be modified: rune only if subset contains the first item e.g.: Candidate={Broccoli, Coke, Milk} ordered according to minimum support {Broccoli, Coke} and {Broccoli, Milk} are frequent but {Coke, Milk} is infrequent Candidate is not pruned because {Coke,Milk} does not contain the first item, i.e., Broccoli Data Mining Sanjay Ranka Fall 23 4 Other Types of Association atterns Maximal and Closed Itemsets Negative Associations Indirect Associations Frequent Subgraphs Cyclic Association Rules Sequential atterns Data Mining Sanjay Ranka Fall

22 Maximal Itemset null Maximal Itemsets A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Infrequent Itemsets ABCDE Border Data Mining Sanjay Ranka Fall Closed Itemset An itemset is closed if there exists no itemset such that: is a superset of All transactions that contain also contains A A B B C C Number of Frequent itemsets: = 3 k = k Number of Closed Itemsets = 3 {A,,A }, {B,,B }, {C,,C } Number of Maximal Itemsets = 3 Data Mining Sanjay Ranka Fall

23 TID Items ABC 2 ABCD 3 BCE 4 ACDE 5 DE Maximal vs. Closed Itemset null Transaction Ids A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE Not supported by any transactions ABCDE Data Mining Sanjay Ranka Fall Maximal vs. Closed Frequent Itemsets Minimum support = 2 null Closed but not maximal A B C D E Closed and maximal AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE Data Mining Sanjay Ranka Fall

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017) Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar (modified by Predrag Radivojac, 27) Association Rule Mining Given a set of transactions, find rules that will predict the