1 Frequent Pattern Mining

Size: px

Start display at page:

Download "1 Frequent Pattern Mining"

Elvin Maxwell
6 years ago
Views:

1 Decision Support Systems MEIC - Alameda 2010/2011 Homework #5 Due date: 31.Oct Frequent Pattern Mining 1. The Apriori algorithm uses prior knowledge about subset support properties. In particular, it relies heavily on the Apriori property in the process of candidate generation and elimination. In this question, you are asked to establish some facts that allow the application of the Apriori Property in the Apriori algorithm. (a) ( 1 / 2 val.) Show that the (relative) support of any nonempty itemset S 0 S, where S is some given itemset, must be as great as the support of S. Let D the sets of task-relevant data, i.e., a database containing the set of all transactions to be analyzed. Let S be a frequent itemset, and sup c (S, D) and sup % (S, D) the absolute and relative supports of S in D, respectively. Finally, let S 0 be any nonempty subset of S. Then, by definition sup c (S, D) = sup % (S, D) D sup c (S 0, D) = sup % (S 0, D) D, where D is the number of transactions in D. Any transaction containing itemset S will also contain itemset S 0. Therefore, sup c (S 0, D) sup c (S, D) and The conclusion follows. sup % (S 0, D) D sup % (S 0, D) D. (b) ( 1 / 2 val.) Show that all nonempty subsets of a frequent itemset must also be frequent. If minsup denotes the minimum (relative) support, an itemset S is a frequent itemset if sup % (S, D) minsup or, equivalently, if sup c (S, D) minsup D, where D is the number of transactions in D.

2 Homework 5 Decision Support Systems Page 2 of 7 Let S 0 be any nonempty subset of S. As seen in Part a, sup c (S 0, D) sup c (S, D) minsup D. Thus, S 0 is also a frequent itemset. (c) (1 val.) Given a frequent itemset L and subset S L, show that the confidence of the rule S 0 (L S 0 ) cannot be larger than the confidence of the rule S (L S), for any S 0 S. We have, for example, If now S 0 is some nonempty subset of S, conf ( S (L S) ) = sup c(l) sup c (S). conf ( S 0 (L S 0 ) ) = sup c(l) sup c (S 0 ). As seen in Part a, sup c (S 0 ) sup c (S), implying that conf ( S 0 (L S 0 ) ) conf ( S (L S) ). (d) (1 val.) A partitioning variation of Apriori subdivides the transactions of a database D into N nonoverlapping partitions. Show that any itemset that is frequent in D must be frequent in at least one partition of D. We want to show that any itemset S that is frequent in D must be frequent in at least one partition of D. We establish this by contradiction. Let D 1,..., D N be a partition of D, i.e., D = N k=1 D k. Let us assume that there is an itemset S that is frequent in D but not frequent in any of the partitions of D. The fact that S is frequent means that sup % (S, D) minsup, where minsup is the minimum support. Equivalently, it means that C sup c (S, D) minsup D. Let c 1,..., c N denote the number of transactions containing itemset S in D 1,..., D N, respectively. We have the following identities: and D = D 1 + D D N sup c (S) = c 1 + c c N. By assumption, S is not frequent in any of the partitions D 1,..., D N, meaning that c 1 minsup D 1... c N minsup D N. Adding up the previous expressions leads to the conclusion that c c N minsup ( D D N ) or, equivalently, sup(s, D) minsup D, c which is a contradiction with our assumption that S is a frequent itemset. follows. The conclusion

3 Homework 5 Decision Support Systems Page 3 of 7 2. Consider the database D depicted in Table 1, containing five transactions. Consider minsup = 60% and minconf = 80%. Table 1: Database D of transactions to be analyzed. TID s T100 {M, O, N, K, E, Y } T200 {D, O, N, K, E, Y } T300 {M, A, K, E} T400 {M, U, C, K, Y } T500 {C, O, O, K, I, E} (a) (2 val.) Using Apriori algorithm, find all frequent itemsets in the database D. Using Apriori, we successively generate the sets C k of candidate k-itemsets, and then verify these for minsup, obtaining the sets L k of frequent k-itemsets. In the database D, this leads to: C 1 = A 1 C 2 D 1 E 4 I 1 K 5 M 3 N 2 O 3 U 1 Y 3 L 1 = E 4 K 5 M 3 O 3 Y 3 C 2 = EK 4 EM 2 EO 3 EY 2 KM 3 KO 3 KY 3 MO 1 MY 2 OY 2 L 2 = EK 4 EO 3 KM 3 KO 3 KY 3 C 3 = L 3 = EKO 3 (b) (2 val.) Using FP-growth algorithm, find all frequent itemsets in the database D. The FP-Growth algorithm starts by building the set L 1 using a similar approach as the Apriori algorithm. The 1-itemsets are then sorted in decreasing order of support as K E M O Y and used to build the FP-tree:

4 Homework 5 Decision Support Systems Page 4 of 7 Root K : 5 E : 4 M : 1 M : 2 O : 2 Y : 1 O : 1 Y : 1 Y : 1 from where the following conditional pattern base is constructed: Cond. Pattern Base Cond. Tree Frequent Pattern Y {{KEMO} : 1, {KEO} : 1, {KM} : 1} K : 3 {KY } : 3 O {{KEM} : 1, {KE} : 2} K : 3, E : 3 {KO} : 3, {EO} : 3, {KEO} : 3 M {{KE} : 2, {K} : 1} K : 3 {KM} : 3 E {{K} : 4} K : 4 {KE} : 4 (c) (1 val.) Compare the two methods in terms of mining efficiency. Apriori has to do multiple scans of the database while FP-growth builds the FP-Tree with a single scan. Candidate generation in Apriori is expensive (owing to the self-join), while FP-growth does not generate any candidates. (d) (2 val.) List all of the strong association rules and corresponding support S and confidence C matching the following metarule, where X is a variable representing customers, and i denotes variables representing items (e.g., A, O, etc.) t D, buys(x, item 1 ) buys(x, item 2 ) buys(x, item 3 ) [S, C]. The following associations verify the above condition: {KO} {E} [0.6, 1] {EO} {K} [0.6, 1]. 1.1 Practical Questions (Using SQL Server 2008) 3. Run the same sssociation mining algorithm you experimented in the lab, but now with a minimum support minsup = 5% and minimum confidence minconf = 50%.

5 Homework 5 Decision Support Systems Page 5 of 7 (a) (2 val.) Indicate the set L 1 of frequent 1-itemsets and the set L 2 of frequent 2-itemsets discovered by the algorithm. The sets L 1 and L 2 of frequent itemsets are L 1 = Sport-100 6, 171 Water bottle 4, 076 Patch kit 3, 010 Mountain tire tube 2, 908 Mountain-200 2, 477 Road tire tube 2, 216 Cycling cap 2, 095 Fender set (Mountain) 2, 014 Mountain bottle cage 1, 941 Road bottle cage 1, 702 Long sleeve logo 1, 642 Short sleeve classic 1, 537 Road-750 1, 443 Touring tire tube 1, 397 Half-finger gloves 1, 363 HL mountain tire 1, 331 Touring , 255 ML mountain tire 1, 083 L 2 = {Mountain bottle cage, Water bottle} 1, 623 {Road bottle cage, Water bottle} 1, 513 {Mountain tire tube, Sport-100} 1, 240 (b) (3 val.) From the above itemsets, determine by yourself (not resorting to the algorithm) the possible associations obtained from the itemsets in Part a. Indicate the confidence associated with each such association rule and all relevant calculations. The possible associations can be computed by analyzing the 2-itemsets above. We get: Rule Water bottle Mountain bottle cage: This rule has a support-count of sup c = 1, 623 and a confidence of conf = 1, 623/4, 076 = It is not a strong rule. Rule Mountain bottle cage Water bottle: This rule has a support-count of sup c = 1, 623

Homework 5 Decision Support Systems Page 6 of 7 and a confidence of conf = 1, 623/1, 941 = 0.836. It is a strong rule.

6 Homework 5 Decision Support Systems Page 6 of 7 and a confidence of conf = 1, 623/1, 941 = It is a strong rule. Rule Water bottle Road bottle cage: This rule has a support-count of sup c = 1, 513 and a confidence of conf = 1, 513/4, 076 = It is not a strong rule. Rule Road bottle cage Water bottle: This rule has a support-count of sup c = 1, 513 and a confidence of conf = 1, 513/1, 702 = It is a strong rule. Rule Sport-100 Mountain tire tube: This rule has a support-count of sup c = 1, 240 and a confidence of conf = 1, 240/6, 171 = It is not a strong rule. Rule Mountain tire tube Sport-100: This rule has a support-count of sup c = 1, 240 and a confidence of conf = 1, 240/2, 908 = It is not a strong rule. (c) (1 val.) Indicate which of the associations from Part b are strong associations. Justify your selection and confirm your results with those obtained by Microsoft Association algorithm. Only associations which verify minimum support and minimum confidence are strong associations. As already seen in the previous question, the only strong associations are: Road bottle cage Water bottle [sup = 30.2%, C = 88.9%] Mountain bottle cage Water bottle [sup = 32.4%, C = 83.6%] This corresponds to the result obtained by Microsoft Association algorithm: (d) (2 val.) Indicate the dependence network computed by the algorithm and explain its meaning. The dependence network can be computed from the above association rules, yielding: Mountain bottle cage Water bottle Road bottle cage This indicates that the existence of either items Road bottle cage or Mountain bottle cage is a strong indicator of the presence of item Water bottle. 4. (2 val.) Note that, besides the confidence associated with each association rule, MS SQL Server also indicates the importance of the rule. Importance determines how useful a given rule is. For example,

7 Homework 5 Decision Support Systems Page 7 of 7 if an item A appears in every transaction, it may give rise to many rules X A, although these rule are not very informative. Such a rule would have a low importance. In the data-mining literature, a quantity providing similar information is the lift and is computed as lift(x Y ) = sup % (X Y ) sup % (X) sup % (Y ). Compute the lift for the association rules mined. For this purpose, take into consideration that the database includes a total of 5, 000 transactions. Indicate your calculations and confirm that the rules with larger lift are also ranked by Microsoft Association algorithm as more important. Computing the lift for the previously mined rules, we get: 1, 513 5, 000 lift(rbc WB) = 1, 702 4, 076 = 1.09 lift(mbc WB) = 1, 623 5, 000 1, 941 4, 076 = 1.02, which agrees with the importance results from Microsoft Association algorithm depicted in the previous question.

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 2017 Road map The Apriori algorithm Step 1: Mining all frequent