Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Juliano Brito da Justa Neves 1 Marina Teresa Pires Vieira {juliano,marina}@dc.ufscar.br Computer Science Department Federal University of São Carlos UFSCAR Caixa Postal 676, São Carlos SP Brazil Abstract Data mining is the process of discovering useful and previously unknown patterns in large datasets. Discovering association rules between items in large datasets is one such data mining task. A discovered rule is considered interesting if it brings new and useful information about the data. In this paper we show that if a dataset can be divided into subclasses and analyzed as if it was a generalization/specialization hierarchy, the class/subclass relationship can lead to the discovery of class-dependent rules showing interesting differences between the behavior of the whole dataset and the behavior of each of its subclasses. These differences are extracted from the association rules mined from the whole dataset and from each subclass separately. Using the concept of generalization/specialization hierarchies, comparisons and tests of statistical hypotheses, we present methods that are able to efficiently discover class-dependent rules and which classes have a higher influence on each rule. They also show when a given stimulus can provoke on one class a response that was not expected if only the undivided dataset was analyzed. Keywords: Data mining; Generalization/Specialization hierarchies; Class-dependent rules; Interesting rules 1 Introduction One of the key aspects of data mining is that the discovered knowledge should be interesting, the term interesting meaning that the discovered knowledge is useful and previously unknown. One of the most studied data mining tasks is the association rule mining, first introduced in [1]. Association rules identify items that are most likely to occur together in a transaction on a significant fraction of all transactions in a dataset. Every rule must satisfy two specified constrains, support, which measures its statistical significance, and confidence, which measures its strength. Also, there are other interest measures besides support and confidence that are used to discover even more interesting association rules. The objective of this paper is not to propose new methods for mining interesting association rules. Using the concept of class/subclass hierarchies generalization/specialization hierarchies), comparisons and tests of statistical hypotheses, we present methods for discovering interesting knowledge that are not being focused by current data mining techniques. Focusing on finding differences that can result in interesting knowledge between association rules discovered from the class/subclass hierarchy, we show in this paper how to identify class-dependent rules, in order to state if a given behavior of the entire population under study occurs in the same manner in all its subclasses and if a given 1 Mphil scholarship CAPES/Brazil

stimulus can lead to different class-dependent responses. For instance, suppose a hospital database in which the rule aspirin improved_health_status holds. It is interesting to know if for a given subclass of the whole dataset, like pregnant women over 40 years old, the same rule holds with support and confidence values so different it can be concluded that this subclass does not follow the behavior of the whole dataset with respect to this rule. Also, it is interesting to know if for a given subclass there are rules with the same stimulus aspirin) having different responses, which were not expected for the whole dataset, like aspirin lower_heart_beat_rate, which may hold only for the pregnant women over 40 years old and not for whole dataset, making explicit that this subclass have a class-dependent response to aspirin. In the next section we review the necessary background for studying the association rule problem and present some current work about interest measures. In section 3 we present the methods and theory for mining class-dependent rules using the concept of generalization/specialization hierarchies and tests of statistical hypotheses. In section 4 we present our conclusions and future work. 2 Background The association rule can be formally stated as follows [1, 2]: Let I = {i 1, i 2,, i m } be a set of distinct attributes, also called items. Let D be a dataset of transactions. Each transaction contains a set of items i i, i j,, i k I. An association rule can be represented in the form X Y, where X, Y I and X Y = Ø. X is called the antecedent and Y the consequent of the rule. A set of items is called an itemset, where for each nonnegative integer k an itemset with k items is called a k-itemset. The support of a given itemset is the percentage of transactions in D in which the given itemset occurs as a subset of the transaction. The confidence of a given itemset is defined as the ratio supportxuy)/supportx), and represents the percentage of all transactions containing X that also contains Y. The task of mining association rules can be decomposed into 2 subtasks: 1. Generate all itemsets with support above an user-specified support threshold. These itemsets are called large itemsets. All others are called small itemsets. 2. For each large itemset, all the rules with confidence above a user-specified confidence threshold are generated as follows: for a large itemset X and any Y X, if supportx)/supportx-y) minimum confidence, then the rule X-Y Y is a valid rule. The support and confidence thresholds were the first interest measures used in the task of mining association rules. Current work deals with discovering association rules using other interest measures besides support and confidence. In [3], association rules that identify correlations are mined, and the presence and absence of items are used as a basis for generating rules. The chi-squared test from classical statistics is used as the measure of significance of associations. In [4], support is still used as an interest measure, however, instead of confidence, a metric called conviction is used. This metric is a measure of implication rather than only cooccurrence. In [5], measures of interest namely entropy gain mutual information), gini index mean squared error) and chi-squared correlation) are used to find association rules that segment large categorical datasets into two parts that are optimal according to some objective function. In [6], a support/confidence border is used to discover association rules that satisfy a number of interest measures, such as support, confidence, entropy, chi-squared and conviction. In [7], confidence based interest measures named any-

confidence, all-confidence and bond are used to discover interesting association rules and [8] presents a survey of other interest measures. Most of the current work on interest measures aims at discovering interesting rules by analyzing the relationships between the items that compose them, which are variable according to the rule being analyzed, and the whole dataset of transactions, which is the same for all rules. Our work presents a different approach, where the same rule can be found in different datasets class/subclass) of transactions, and the differences in the relationship between the rule and each dataset it occurs are used as a measure of interest. 3 Mining Class-Dependent Rules Our approach can be described as follows: Let D be a dataset of transactions. According to some user defined criteria, D can be divided into k subclasses D i, where D 1 U D 2 U U D k = D and D 1 I D 2 I I D k = Ø. Using the concept of generalization/specialization hierarchies, D is said to be specialized into k subclasses or, reciprocally, D 1, D 2,, D k are being generalized into class D, as shown in Figure 1. This hierarchy is disjoint and total, so it can be stated that each transaction in D is found in only one subclass and all association rules from each subclass are found in D, possibly with different support and confidence values. If D already presents a generalization/specialization hierarchy, the proposed approach naturally adapts itself to it. Figure 1 - Generalization/Specialization Hierarchy Let r be the association rule X Y. Analyzing r with respect to D, r has support s D, where s D is the number of transactions in D in which r holds, and confidence c D, where c D is the number of transactions in D which contains X and Y divided by the number of transactions in D which contains X. Backed up by the concept of generalization/specialization hierarchies, it can be stated that r occurs in all k subclasses of D, with support s Di, where s Di is the number of transactions in D i in which r holds and s D1 + s D2 + + s Dk = s D, and confidence c Di, where c Di is the number of transactions in D i which contains X and Y divided by the number of transactions in D i which contains X. Let n D be the total number of transactions in D and n Di be the total number of transactions in each D i and n D1 + n D2 + + n Dk = n D. Next the two kinds of class-dependent rules are defined: Definition 1: Same Stimulus, Same Response, Different Support and/or Confidence Values) The rule r is defined class-dependent and, therefore, interesting, if r holds for D and if the ratio s D /n D is significantly different from s Di / n Di and/or c Di is significantly different from c D, for a given subclass D i and according to tests of statistical hypotheses.

From statistics, if a population follows a given distribution, a random sample from it will follow the same distribution. For a given rule, if the tests of statistical hypotheses identify differences in the support and/or confidence values, the subclass does not follow the same distribution of the class with respect to this rule. In this case the sample is not random and indicates that there is a dependency between the rule and a given class, so the rule is classdependent. Definition 2: Same Stimulus, Different Responses) The rule X Y is class-dependent and, therefore, interesting, if for a specified minimum support and confidence: i. X is frequent in D and in D i, ii. X Y holds for D i, but does not hold for D. These rules state that for the same stimulus the antecedent of the rule), there can be different responses the consequent of the rule) and those responses are class-dependent, since those responses only occurs in a specific subclass, and not for the entire dataset. On both definitions, each rule and stimulus must be tested against all subclasses D i, since it is interesting to know regarding which subclass a rule is considered class-dependent. In cases with more than 2 subclasses, it is possible for one class to follow the behavior of the whole dataset while the others subclasses does not follow, creating rules that are classdependent regarding some subclasses and not class-dependent regarding others subclasses. The description of tests of statistical hypotheses and the analysis of the same stimulus having different responses are presented in the following sections. 3.1 Problem 1: Same Stimulus, Same Response, Different Support and/or Confidence Values The problem of mining class-dependent rules can be solved as follows: First, the dataset of transactions and its subclasses are analyzed assuming that they follow a binomial probabilistic model. The binomial probabilistic model is the most frequently found model encountered among discrete probability functions, and it is associated with repeated trials of an experiment which has only two classes of outcomes, such as success or failure [9]. The dataset of transactions is the experiment, and each transaction is a trial of the experiment. In this case, the two possible outcomes for each trial are the rule holds and the rule does not hold. Using any association rule mining algorithm, for instance Apriori [2], FP-growth [10] or ECLAT [11], mine all the association rules that satisfy the user-specified minimum support and confidence from the dataset of transactions. For each rule r mined, find the same rule in each subclass of D. Notice that the rule may have different support and confidence values for each subclass. With the number of transactions in D and in each subclass and the support and confidence values for D and for each subclass, use the test of statistical hypotheses for the binomial probabilistic model to check for statistically significant differences in support and confidence values. The algorithm for mining class-dependent rules is shown in Figure 2; the test of statistical hypotheses and the confidence coefficient α used in the algorithm will be explained next. Algorithm: Class-dependent rules Input: Dataset D, its subclasses D i, confidence coefficient α, support s and confidence c Output: The list of class-dependent rules L r 1. Using a association rule mining algorithm, mine all rules satisfying s and c from D 2. For each rule r mined 3. For each subclass D i

4. Mine rule r from D i 5. Test rule r for support and confidence differences using α 6. If r is different according to α, add r to L r 7. end 8. end 9. return L r Figure 2 -Algorithm for finding class-dependent rules From statistics, in order to test two proportions p 1 and p 2, it is required the formulation of two hypotheses, H 0 the null hypothesis) and H 1 the alternative hypothesis). The hypotheses for testing if p 1 and p 2 are different are: H 0 : p 1 = p 2 H 1 : p 1 p 2 In order to accept the alternative hypothesis it is necessary to prove that the null hypothesis does not hold. For this purpose, the following expression must be evaluated: p1 p2 1) p2 1 p2 ) N n ) ) n N 1 where p 1 is the proportion observed in the sample being analyzed, p 2 is the same proportion observed in the population from where the sample was extracted, n is the size of the sample and N is the size of the population. If the value of z is less than -Z α/2 or greater than +Z α/2, where Z α/2 depends on the confidence coefficient α, we can reject the null hypothesis and accept the alternative one, as shown in Figure 3. Figure 3 - Acceptance and rejection areas It must be noticed that accepting a hypothesis does not mean that it is true, it just mean that there is not enough evidence that it is false [9]. Table 1 shows the most frequently used confidence coefficients and their corresponding Z α/2 values according to [9]: 1 α Z α/2 0.90 0.95 0.954 0.98 0.99 0.997 1.64 1.96 2.00 2.33 2.58 3.00 Table 1 - Frequently used confidence coefficients and corresponding Z α/2 values [9]

Our hypothesis is that the support and/or the confidence of a given rule are different in each subclass from the support and/or confidence of the same rule in the whole dataset, indicating that the rule behavior is dependent on the subclass it occurs. In order to analyze the differences between support values, we replace p 1 in 1) with the support value for D i and p 2 with the support value for D, resulting in sdi sd 2) sd 1 sd ) N n ) ) n N 1 In order to analyze the differences between confidence values, we replace p 1 in 1) with the confidence value for D i and p 2 with the confidence value for D resulting in cdi cd 3) cd 1 cd ) N n ) ) n N 1 If a rule is class independent it will occur with the same support and confidence in each subclass, and the tests will accept the null hypothesis for support and confidence. 3.2 Problem 2: Same Stimulus, Different Responses The problem of mining association rules showing the same stimulus the antecedent of the rule) having different responses the consequent of the rule) depending on its subclass can be solved as follows: First, using an association rule mining algorithm, mine all the association rules that satisfy the user-specified minimum support and confidence from the dataset of transactions. For each different stimulus X the antecedent of a rule in the form X Y) mined from the class, search each subclass for rules X W where Y W with the following constraints: For X W, number of transactions in D i in which the rule holds)/number of transactions in D i ) minimum support; For X W, number of transactions in D in which the rule holds)/number of transactions in D) < minimum support; For X W, confidence Di minimum confidence, where confidence Di is the confidence with respect to D i. These constraints guarantee that the rule X W is class-dependent, once it does not hold for D but it does hold for D i. If for X W the ratio number of transactions in D i in which the rule holds)/number of transactions in D i ) is less than the minimum support and/or its confidence is less than the minimum confidence, the rule X W is not significant enough in this subclass and therefore is not considered. The algorithm for finding the same stimulus having different responses is shown in Figure 4. Algorithm: Find the same stimulus having different responses Input: Dataset D, its subclasses D i, support s and confidence c Output: The list stimulus and its class-dependent responses L X, r 1. Using a association rule mining algorithm, mine all rules satisfying s and c from D 2. For each subclass D i 3. Mine all rules satisfying s and c from D i 4. end 5. For each stimulus X mined from D 6. For each subclass D i

7. Find all rules with X as a stimulus 8. For each rule r found, check if this rule holds for D 9. If r does not hold for D, add the pair [X, r] to L X, r 10. end 11. end 12. Return L X, r Figure 4 - Find the same stimulus having different responses The returned rules are the class-dependent rules for a given stimulus found in D. 4 Examples and Experiments In this section it is presented an example with synthetic data, in order to show the usage of the methods presented, and an experiment with a public available dataset. 4.1 Analysis of Synthetic data Let Table 2 be the dataset of transactions D and Table 3 the rules mined with minimum support = 20% and minimum confidence = 80%, where s and c are respectively the support and the confidence of the rules. Transaction ID Items in transaction Transaction ID Items in transaction 1 2 3 4 5 6 7 8 9 10 B D E F A B C F A B D F C D F G A B E F A E F G B C D E A B D G 1) E B s=45.0% c=81.8% 2) E F s=45.0% c=81.8% 3) C A B s=20.0% c=100.0% 11 12 13 14 15 16 17 18 19 20 Table 2 - Dataset of transactions D Rules mined Table 3 - Rules mined A B C F A E F G B C E F A D F G C D F G A B C G B C E G A B C G 4) E G F s=30.0% c=85.7% 5) E G B F s=20.0% c=80.0% 6) G B F E s=20.0% c=100.0% Suppose that for a given user criteria, D is divided into two subclasses, D 1 and D 2. Table 4 shows subclass D 1 and Table 5 shows subclass D 2. Despite the example, the subclasses do not need to have the same number of transactions. Transaction ID Items in transaction Transaction ID Items in transaction 1 2 3 4 5 A B C F A E F G A B C F A B D F A D F G 1 2 3 4 5 B C E F B D E F C D F G

6 7 8 9 10 A B E F A E F G A B C G A B D G A B C G 6 7 8 9 10 C D F G B C D E B C E G Table 4 Subclass D 1 Table 5 Subclass D 2 As an example, the analyses of rules 1 and 2 from Table 3 are presented next, with α = 0.05. The rule 1 occurs in subclass D 1 with support = 10.0%. Using formula 2) to find differences between support values Problem 1), 0.10 0.45 = 3.07 0.451 0.45) 20 10 ) ) 10 20 1 which is less than -1.96. So the null hypothesis is rejected and we can say that rule 1 does not have the same support value in D and in D 1, indicating that D 1 does have a behavior different from D with respect to rule 1. The same applies to D 2, where the same rule appears with support = 80.0%, resulting in a z value of +3.07, which is greater than +1.96. If the support values were equal by the test of statistical hypotheses, the test regarding confidence would be used. Testing rule 1 for different responses to the same stimulus Problem 2), we find in D 1 the rule E A with support = 30.0% and confidence = 100.0%. This rule does not hold for D, but does hold for D 1. In D 2 there are no rules with E as a stimulus with support and confidence above the specified threshold except for E F, but this rule does hold for D. This test is useful because analyzing D we expect the stimulus E to provoke the response B, and it does happen, but for D 1, the stimulus E provokes other class-dependent response that were not expected if only D have been analyzed. Note that we do not have to test rule 2 for different responses to the same stimulus, since its stimulus is equal to one already tested. For rule number 2, there is in subclass D 1 the same rule with support = 30% and confidence = 100.0%. Using formula 2) to find differences between support values Problem 1), 0.30 0.45 = 1.31 0.451 0.45) 20 10 ) ) 10 20 1 which allows the acceptation of the null hypothesis for support. Using formula 3) to find differences between confidence values Problem 1), 1.00 0.818 = + 2.06 0.8181 0.818) 20 10 ) ) 10 20 1 which allows the rejection of the null hypothesis for confidence. In subclass D 2 the same rule appears with support = 60.0% and confidence = 75.0%. Using formula 2) to find differences between support values Problem 1),

0.60 0.45 = + 1.31 0.451 0.45) 20 10 ) ) 10 20 1 which allows the acceptation of the null hypothesis for support and using formula 3) to find differences between confidence values Problem 1), 0.75 0.818 = 0.77 0.8181 0.818) 20 10 ) ) 10 20 1 which allows the acceptation of the null hypothesis for confidence. These results for D 1 and D 2 states that rule 2 is class-dependent regarding confidence only for D 1. These results from the analyses of rule 1 and 2 can offer relevant information, for instance, considering a real situation, suppose that D is a hospital dataset and each item is a symptom for a given disease and the association rules mined shows relationships among the symptoms. If the hospital staff desires to reduce the occurrence of the symptom B by treating symptom E, it would be interesting to know that the associated symptom B is classdependent, and subclass D 2 will benefit more from the treatment than subclass D 1, where the symptom A would also be affected. Also, it would be interesting to know that if the symptom E is treated, the occurrence of symptom F would be affected differently on each subclass. To the best of our knowledge all this information cannot be discovered by traditional association rule mining algorithms. 4.2 Analysis of a Public available dataset Using the voting-records dataset from [12] as a real example, dividing its 435 transactions into two subclasses, republican 168 transactions) and democrat 267 transactions), using support = 20% and confidence = 80% and analyzing only foreign politics votes 2 el-salvadoraid, aid-to-nicaraguan-contras, immigration, duty-free-exports and export-administration-actsouth-africa), we discovered that all 25 discovered association rules from the whole dataset are class-dependent rules, according to definition 1, and only 1 discovered association rule from the whole dataset is class-dependent according to definition 2. For instance, the rule aidto-nicaraguan-contras=yes el-salvador-aid=no with support = 46.9% and confidence = 84.3%, is analyzed next. This rule occurs in the republican subclass with support = 4.2% and confidence = 29.2% and in the democrat subclass with support = 73.8% and confidence = 90.4%. Using formula 2) to find differences between support values Problem 1): 0.042 0.469 z republican = = 14.14 0.4691 0.469) 435 168 ) ) 168 435 1 0.738 0.469 z democrat = = + 14.16 0.4691 0.469) 435 267 ) ) 267 435 1 which allows the rejection of the null hypothesis on both cases. Empirical tests show that for most real datasets it is very difficult for the user to divide it in a way the rules end up being 2 In order to reduce the number of discovered association rules from more than 16000 rules to only 25 and keep the dataset focused on only one single subject foreign politics).

considered class independent, so it is interesting to rank the subclasses in order to know which subclass has the most different behavior considering a given rule. This rank is done analyzing the absolute value of z. In the example of the voting dataset z republican = 14.14 and z democrat = 14.16, showing that the democrat have the most different behavior from all subclasses. This is particularly useful when there are more than two subclasses. Continuing the analysis for this rule, analyzing only the entire dataset of foreign politics votes we could end up assuming that almost half of the congressmen would support the rule independent of his political affiliation, but it is clear that this rule is endorsed only by the democrats. Of course this information could be retrieved from the entire dataset if in each transaction the political affiliation was present, what could result in a rule of the form aid-to- Nicaraguan-contras=yes democrat=yes el-salvador-aid=no, but the dataset can be divided according to any user criteria, not only the presence or absence of a given item. Finally the stimulus do not provoke any other response in each subclass for support = 20% and confidence = 80% Problem 2). Another rule is el-salvador-aid=yes immigration=yes duty-free-exports=no with support = 20.2% and confidence = 82.2%. This rule appears in the republican subclass with support = 42.9% and confidence = 84.7% and in the democrat subclass with support = 6.0% and confidence = 72.7%. Using formula 2) to find differences between support values Problem 1): 0.429 0.202 z republican = = + 9.34 0.2021 0.202) 435 168 ) ) 168 435 1 0.06 0.202 z democrat = = 9.29 0.2021 0.202) 435 267 ) ) 267 435 1 which allows the rejection of the null hypotheses on both cases, with the republicans having the most different behavior. Testing for the same stimulus having different responses, we find in the republican subclass the rule el-salvador-aid=yes immigration=yes aid-to- Nicaraguan-contras=no with support = 42.9% and confidence = 84.7%. This rule holds for the republican subclass but does not hold for the entire dataset showing that the same stimulus provoked a different response dependent on the republican subclass. 5 Conclusions This paper introduced methods for mining association rules that make explicit the differences in behavior from the whole dataset and its subclasses. These methods are able to efficiently discover when a given association rule is class-dependent and what are the classes that have a higher influence on each rule. They also show when a given stimulus can provoke on one class a response that was not expected if only the undivided dataset was analyzed. Also, this analysis can be extremely useful in cases where a generalization/specialization hierarchy already exists, when this analysis can lead to the same stimulus having different responses and the items of this responses occurs only in the subclasses. Furthermore, the presented methods can be incorporated into current association rule mining algorithms. To the best of our knowledge, there are no current data mining algorithms implementing similar methods, or able to discover the same information.

References [1] Agrawal, R.; Imielinski, T.; Swami, A. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, Washington, D.C., USA, 1993 [2] Agrawal, R.; Srikant, R. Fast algorithms for mining association rules. In Proc. of the Int'l Conf. on Very Large Databases, Santiago de Chile, Chile, 1994 [3] Brin, S.; Motwani, R.; Silverstein, C. Beyond market baskets: generalizing association rules to correlations. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, Tucson, Arizona, USA, 1997 [4] Brin, S.; Motwani, R.; Ullman, J.; Tsur, S. Dynamic itemset counting and implication rules for market basket data. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, Tucson, Arizona, USA, 1997 [5] Morimoto, Y.; Fukuda, T.; Matsuzawa, H.; Tkuyama, T.; Yoda, K. Algorithms for mining association rules for binary segmentation of huge categorical databases. In Proc. of the Int'l Conf. on Very Large Databases, New York City, NY, USA, 1998 [6] Bayardo, R.; Agrawal, R. Mining the most interesting rules. In Proc. of the ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, San Diego, CA, USA, 1999 [7] Omiecinski, E. Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering, vol. 15 no. 1, January/February 2003 [8] Hilderman, R.J.; Hamilton, H.J. Knowledge discovery and interestingness measures: a survey. Technical Report CS 99-04, Department of Computer Science, University of Regina, Canada, 1999 [9] Hughes, A.; Grawoig, D. Statistics: A foundation for analysis. Addison-Wesley, 1971 [10] Han, J.; Pei, J.; Yin, Y. Mining frequent patterns without candidate generation. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, Dallas, Texas, USA, 2000 [11] Zaki, M.; Parthasarathy, S.; Ogihara, M.; Li, W. New algorithms for fast discovery of association rules. In Proc. of the ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, Newport Beach, CA, USA, 1997 [12] Blake, C.L.; Merz, C.J. UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/mlrepository.html]. University of California, Department of Information and Computer Science, 1998