10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

Size: px

Start display at page:

Download "10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules"

Alvin Richardson
6 years ago
Views:

1 10/19/2017 MIST6060 Business Intelligence and Data Mining 1 Examples of Association Rules Association Rules Sixty percent of customers who buy sheets and pillowcases order a comforter next, followed by drapes There is a legend diapers and beer story that you can easily find from a number of sources on the Internet The following one is cited from A number of convenience store clerks noticed that men often bought beer at the same time they bought diapers The store mined its receipts and proved the clerks observations correct So, the store began stocking diapers next to the beer coolers, and sales skyrocketed The explanation goes that when fathers are sent out on an errand to buy diapers, they often purchase a six-pack of their favorite beer as a reward Terminology and Examples An itemset is the set of items that appear together in a (transaction) record For example, diapers and beer are items individually Together, diapers and beer is an itemset An itemset containing k items is called a k-itemset For example, diapers and beer is a 2-itemset A 1-itemset is just a single item The diapers and beer association can be expressed as a rule if young men buy diapers, then they buy beer too (or vice versa) In association rules, the if part is called the antecedent, and the then part is the consequent The rule is generally written as: antecedent consequent So, the diapers and beer rule is expressed as: diapers beer Measures Let X and Y be two disjoint itemsets (where disjoint means X and Y have no common items; also note that X and Y are itemsets and thus can each contain multiple items) The following measures are defined in terms of the rule X Y The support of the rule X Y, or equivalently, the support of itemsets X and Y, is the percentage (or number) of transactions containing both X and Y: number of transactions containing both X support( X Y ) = support( X Y ) = total number of transactions which represents the probability P( X Y ) and Y,

2 10/19/2017 MIST6060 Business Intelligence and Data Mining 2 An itemset is called a frequent itemset (or large itemset) if its support is greater than or equal to a prespecified minimum value (minimum support) The confidence of the rule X Y is the percentage of transactions containing X that also contain Y: number of transactions containing both X and Y confidence ( X Y ) =, number of transactions containing X which represents the conditional probability P( X Y X ) = P( X Y ) / P( X ) The lift of the rule X Y is the ratio of the confidence to the confidence that if the consequent Y is independent of the antecedent X (ie, if knowing X does not help predicting Y): lift( X confidence( X Y ) Y ) =, support( Y ) which represent the probability ratio P( X Y ) P( X ) P( Y ) Clearly, it is desirable to have higher values for support, confidence, and/or lift An Illustrative Example (Book Transaction) The following is a set of four transactions related to computer books sold in an online bookstore: Transaction ID Items 101 Database (D), Web (W), Java (J) 102 Web (W), MS-Office (M), Net (N) 103 Web (W), Java (J), MS-Office (M), Net (N) 104 MS-Office (M), Net (N) The transaction dataset includes 5 items: D, J, M, N, and W As an example, let X = {M, N} and Y = {W} Then, X Y means if customers buy MS-Office and Net books, then they buy Web books too Note that the dataset is not in a structured table format with clearly defined attributes/columns; it is in a semi-structured format We have: # transactions containing M, N and W 2 support ( X Y ) = = = 50%, total # transactions 4

3 10/19/2017 MIST6060 Business Intelligence and Data Mining 3 # transactions containing M, N and W 2 confidence ( X Y ) = = = 6667%, # transactions containing M and N 3 confidence( X Y ) 2 / 3 8 lift ( X Y ) = = = = support( Y ) 3/ 4 9 Apriori Algorithm for Finding Frequent Itemsets An exhaustive search for all frequent itemsets in a dataset that contains m items can potentially generate up to 2 m 1 itemsets, which is computationally very expensive To reduce computational cost, the Apriori algorithm uses the following principle: If an itemset is frequent, then all of its subsets must also be frequent To find the frequent k-itemsets, the Apriori algorithm runs up to k iterations Each iteration includes two steps: Step 1: Find all possible candidate itemsets Based on the Apriori principle, for each current iteration after the first iteration, we only need to consider those itemsets that have survived in the previous iteration Step 2: Compute each candidate s support and select those itemsets whose supports are greater than or equal to the prespecified minimum support value The itemsets not selected will be eliminated from further consideration Apriori Algorithm for the Book Transaction Example We illustrate how the Apriori algorithm works for the book transaction example Let minimum support = 50% That is, a frequent itemset must have support count 2 Iteration 1: Find all frequent 1-itemsets Candidate 1-Itemset Frequent 1-Itemset {D} 1 {J} 2 {J} 2 {M} 3 {M} 3 {N} 3 {N} 3 {W} 3 {W} 3

4 10/19/2017 MIST6060 Business Intelligence and Data Mining 4 Iteration 2: Find all frequent 2-itemsets Since {D} is not a frequent 1-itemset, based on the Apriori principle, we can eliminate {D, J}, {D, M}, {D, N} and {D, W} in searching for candidate frequent 2-itemsets That is, we consider all possible 2-item combinations with components J, M, N and W Thus, by using the Apriori principle the number of candidates is 5 4 reduced from = 10 to = Candidate 2-Itemset {J, M} 1 {J, N} 1 Frequent 2-Itemset {J, W} 2 {J, W} 2 {M, N} 3 {M, N} 3 {M, W} 2 {M, W} 2 {N, W} 2 {N, W} 2 Iteration 3: Find all frequent 3-itemsets Since {J, M} and {J, N} are not frequent 2-itemset, we can eliminate any 3-itemset involving these two subsets (ie, {J, M, W} and {J, N, W}) As a result, all candidate 3-itemsets can only have subsets {M, N}, {M, W} or {N, W} There is only one such candidate, {M, N, W}, which turns out to be frequent (support count 5 2) Again, using the Apriori principle, the number of candidates is reduced from = 10 3 to one Candidate 3-Itemset Frequent 3-Itemset {M, N, W} 2 {M, N, W} 2 Iteration 4: There is only one 4-itemset that contains {M, N, W}, whose support count is 1, which is less than the minimum support value 2 So, there is no frequent 4-itemset The search stops Generating Association Rules from Frequent Itemsets Generating association rules from frequent itemsets involves computing confidence This is easy because it is not necessary to scan the dataset again to get the counts Since any subset of a frequent itemset must also be in the list of frequent itemsets, whose support counts have been calculated during the Apriori procedure for finding frequent itemsets, it is then easy to compute the confidence as the ratio of the support for the itemset to the support for each subset of this itemset The association rules with confidence value greater than (or equal to) the minimum confidence value will be selected

5 10/19/2017 MIST6060 Business Intelligence and Data Mining 5 For the book example, consider rules that can be generated from the frequent itemset {M, N, W} We have the following association rules and corresponding confidence values: 1 {M, N} {W}, confidence = = = 6667% support count of {M, N} 3 2 {M, W} {N}, confidence = = = 100% support count of {M, W} 2 3 {N, W} {M}, confidence = = = 100% support count of {N, W} 2 4 {M} {N, W}, confidence = = = 6667% support count of {M} 3 5 {N} {M, W}, confidence = = = 6667% support count of {N} 3 6 {W} {M, N}, confidence = = = 6667% support count of {W} 3 So, if the cutoff confidence value is 60%, all rules will be selected; if the cutoff is 70%, only rules 2 and 3 will be selected Note that an association rule does not necessarily imply causality It suggests a strong cooccurrence relationship between items in the antecedent and consequent of the rule Association Rules in Weka The bookarff file is shown below Note that if a transaction does not contain an item, the corresponding item value is set as a? You can also use, say, n, for the same purpose (and then specify {y,n} for each attribute in the header) But then the output results will be slightly more difficult to understand and Database Java MSOffice Net Web y,y,?,?,y?,?,y,y,y?,y,y,y,y?,?,y,y,?

10/19/2017 MIST6060 Business Intelligence and Data Mining 6 1 Click Explorer Click Open file, find and open the bookarff file 2 Click Associate (and then Choose / associations / Apriori, if it is not

6 10/19/2017 MIST6060 Business Intelligence and Data Mining 6 1 Click Explorer Click Open file, find and open the bookarff file 2 Click Associate (and then Choose / associations / Apriori, if it is not default one) 3 Click the long horizontal box on the right side of the Choose button A pop-up weakguigenericobjecteditor will appear In the lowerboundmin box, enter 05 (ie, 50%) In the metrictype box, select Lift Typically, we use the default Confidence metric, which will show confidence only without lift values in the output In the minmetric box, enter 08 (for the cutoff lift value) In the numrules (number of rules) box, enter 14 (the default value of 10 will display only 10 rules, not all 14 rules) In the outputitemsets box, select True (the default is False because this part of the output can be very large for a large dataset) Keep the default values for all the other parameters Click OK

10/19/2017 MIST6060 Business Intelligence and Data Mining 7 4 Click Start to get the results below, which are consistent with our hand calculation Note that a frequent itemset is called a large

7 10/19/2017 MIST6060 Business Intelligence and Data Mining 7 4 Click Start to get the results below, which are consistent with our hand calculation Note that a frequent itemset is called a large itemset in Weka (both terms are widely used in practice) If in the previous step, we specify a Confidence cutoff value of 06 (instead of a Lift cutoff value of 08), we will have the same results but less detailed output information

button and open the bookcsv file) Click Execute Deselect Partition Select Input for all attributes Click Execute again 2

8 10/19/2017 MIST6060 Business Intelligence and Data Mining 8 Association Rules in Rattle (Data in Table Format) 1 Click Data Click the ARFF radio button for an ARFF file In the Filename box, find and open the bookarff (or click the File button and open the bookcsv file) Click Execute Deselect Partition Select Input for all attributes Click Execute again 2 Click Associate In the box, specify In the Confidence box, specify In the Sorted by box, specify Lift Click Execute

10/19/2017 MIST6060 Business Intelligence and Data Mining 9 3 Click the Show Rules button to get rules Note that Rattle excludes rules with multiple consequents, resulting in a total of 11 rules

9 10/19/2017 MIST6060 Business Intelligence and Data Mining 9 3 Click the Show Rules button to get rules Note that Rattle excludes rules with multiple consequents, resulting in a total of 11 rules (compared to 14 rules by Weka) Association Rules in Rattle (Data in Transaction Format) The book transaction data are store in the booktrancsv file as follows: tranid, Item 1, Database 1, Web 1, Java 2, Web 2, MSOffice 2, Net 3, Web 3, Java 3, MSOffice 3, Net 4, MSOffice 4, Net

again 2 Click Associate Select the Baskets check box In the box, specify 05000 In the Confidence box, specify 06000 In the Sorted by box,

10 10/19/2017 MIST6060 Business Intelligence and Data Mining 10 1 Click Data Click the File radio button In the Filename box, find and open the booktrancsv file Deselect Partition Click Execute Select Ident for the tranid attribute and select Target for the Item attribute Click Execute again 2 Click Associate Select the Baskets check box In the box, specify In the Confidence box, specify In the Sorted by box, specify Lift Click Execute Then click Show Rules You will see the same results as those based on the data in table format (bookarff or bookcsv file)

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2) The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set