Basic Association Rules

Size: px

Start display at page:

Download "Basic Association Rules"

Brendan Day
6 years ago
Views:

1 Basic Association Rules Downloaded 11/21/17 to Redistribution subject to SIAM license or coyright; see htt:// Abstract Previous aroaches for mining association rules generate large sets of association rules. Such sets are difficult for users to understand and manage. Here, the concet of a restricted conditional robability distribution is used to exlain an association rule. Based on this concet, a new tye of association rules, called basic association rules, is defined. We roose the GenBR algorithm to generate the set of classes of basic association rules. Theoretical analysis shows that the search sace of the algorithm can be translated to an n- cube grah. The set of classes of basic association rules generated by GenBR is easy for users to understand and manage. Our exeriments on synthetic and real datasets show that GenBR is either faster than revious aroaches or generates fewer rules or both. Keywords: Association Rules, Probabilistic/Statistical Methods, Data Reduction/Prerocessing, Restricted Conditional Probability Distributions, Inference Systems, and Basic Association Rules. 1 Introduction Techniques for mining association rules [1,2] were originally devised for alication to market basket data, but they have also been alied in many other domains to erform tasks [21,23,26]. Market basket data describes the items urchased from retail stores groued into transactions. A transaction tyically consists of items bought together at the same oint of time, but it may consist of items bought by a customer over a eriod of time. An itemset is a set of items, and a frequent itemset X is an itemset whose frequency in transactions, also referred to as its suort, denoted as su(x), is greater than a user secified suort threshold, minsu. The main task of association rule discovery is to extract frequent itemsets from market basket data and to generate association rules from these frequent itemsets. An association rule r is an imlication of the form X Y, where X and Y are two disjoint itemsets. The suort of the rule is the suort of X Y, denoted as su(r), which is given by the observed robability P(X = 1,Y = 1). The confidence of the rule, denoted conf(r), is given by the conditional observed robability P(X =1,Y = 1) / P(X = 1), which is denoted as (xy) / (x) in this aer. If an association rule has suort at least as great as minsu and Guichong Li and Howard J. Hamilton Deartment of Comuter Science University of Regina Regina, SK, Canada, S4S 0A2 {liguicho, hamilton}@cs.uregina.ca confidence at least as great as the confidence threshold called minconf, it is referred to as a valid association rule. An association rule with confidence 100% is an exact association rule; all other association rules are aroximate association rules. The Ariori algorithm [2] was roosed to discover all frequent itemsets and to generate all valid association rules corresonding to these itemsets by a fast algorithm, called FastGenRules. Many algorithms have since been roosed that reduce the time and sace required to find the frequent itemsets [2,14]. After all frequent itemsets have been found, valid association rules are generated. A serious roblem in association rule discovery is that the set of association rules can grow to be unwieldy as the number of transactions increases, esecially if the suort and confidence thresholds are small. As the number of frequent itemsets increases, the number of rules resented to the user tyically increases roortionately. Many of these rules may be redundant. The definition of redundancy for association rules has varied in revious aroaches. Toivonen et al. roosed finding a structural rule cover, which describes the same database rows as the original set of association rules [28]. Therefore, those rules that are not in the cover are regarded as redundant. In [11,20,24,30], the definition of redundant rules is based on several inference rules or an inference system. Therefore, all association rules that can be derived from other rules by alying inference rules are regarded as redundant. We adot the latter tye of definition. To address the roblem of rule redundancy, four tyes of research on mining association rules have been erformed. First, rules have been extracted based on userdefined temlates or item constraints [3,27]. Secondly, researchers have develoed interestingness measures to select only interesting rules [16,19,18]. Thirdly, researchers have roosed inference rules or inference systems to rune redundant rules and thus resent smaller, and usually more understandable sets of association rules to the user [5,11,20,24,30]. Finally, new frameworks for mining association rule have been roosed that find association rules with different formats or roerties [7,8,9]. The main roblems with revious aroaches are that they still generate too many rules, and these rules may be

2 redundant. For examle, a valid association rule X Y that is generated by one these aroaches may in fact be derived from some simler rule X Y with the same confidence as X Y, where X X and Y Y. Inference rules roosed by these aroaches do not resemble Armstrong axioms on functional deendencies. As well, in some aroaches, inference rules cannot infer the confidence of rules without extra information. In our research, we are creating an inference system on association rules, consisting of a set of inference rules such as augmentation and transitivity, which resembles the Armstrong axioms on functional deendencies and which allows the inference of the confidences of rules. The remainder of this aer is organized as follows. In Section 2, we resent related work. In Section 3, we define the concet of a basic association rule, and roose a new algorithm called GenBR for generating the set BR of classes of basic association rules from a set of frequent itemsets. The comutational comlexity of GenBR is also described. A comarison of our aroach and other aroaches is resented in Section 4. Our exeriments comared the erformance of our algorithm with that of revious algorithms on synthetic datasets and real-life datasets, with resect to the number of rules and the elased running time. Conclusions and future work are described in Section 5. 2 Previous Work Previous research showed that relatively small sets of association rules can be resented to users instead of all valid association rules. As well, for some aroaches, inference rules were suggested that allowed additional association rules to be derived from such small sets of rules. In this section, we describe three aroaches. First, reresentative association rules (RR) are based on a cover oerator with which other non-reresentative association rules can be generated [20]. Suose we have an association rule X Y. A cover oerator C, denoted C(X Y), is given by C(X Y) = {X Z V Z, V Y Z V = V } The set of all reresentative association rules is a minimal set of rules that covers all association rules by means of the cover oerator. The FastGenReresentative algorithm was roosed to efficiently comute a RR [20]. Secondly, a kind of non-redundant association rules with minimal antecedents and maximal consequents, called minimal non-redundant association rules, has been identified as articularly useful and relevant [5]. An association rule r: X Y is a minimal non-redundant association rule iff there does not exist an association rule r : X Y with su(r) = su(r ), conf(r) = conf(r ), X X and Y Y. A small non-redundant generating set for all valid association rules is formed by combining a generic basis GB for exact association rules and an informative basis IB for aroximate association rules. RI is defined as a transitive reduction of the informative basis corresonding to IB. Given a closure oerator c of the Galois connection, a set FC of frequent closed itemsets, the set G of their generators, and a artial order (inclusion relation) on the set of itemsets, the definitions of GB, IB and RI are as follows. GB = {r: g (f \ g) f FC g G f g f} IB = {r: g (f \ g) f FC g G c(g) f} RI = {r: g (f \ g) f FC g G c(g) f / f c(g) f f} Bastide et al. have roven that GB and IB contain only minimal non-redundant association rules and all exact association rules and aroximate association rules can be derived from GB and IB, resectively [5]. The GB and RI algorithms were roosed to generate a generic basis and a transitive reduction of the informative basis, resectively. According to the definition of a minimal non-redundant association rule, the suort and confidence of any association rules inferred from the generating set are the same as the suort and confidence of the rules from which they were inferred. The authors claim that none of the Armstrong axioms hold in non-redundant association rules. A similar aroach has been roosed for discovering a small cover for association rules based on closed itemsets, which adats the Duquenne-Guigues basis for exact association rules and the Luxenburger results for aroximate association rules [24]. Thirdly, informative cover has been roosed together with a new inference rule [11]. Let r, r be two association rules, denoted X Y and X Y, such that X Y X Y. If su(x ) su(x), we say that r covers r, denoted r r. The goal is to find an informative cover that covers all other association rules. The CoverRules algorithm has been roosed to generate an informative cover for association rules [11]. The cover oerator in the informative cover aroach is similar to the cover oerator in the reresentative association rule aroach. The difference between them is that the cover oerator of the informative cover aroach does not require the antecedent of the resulting association rule to be included in the antecedent of the initial association rule. In addition, the inference rocedure is not urely syntactic [11], because it uses Table 2.1. A Binary Dataset. A B C D E

3 information about the suort of the antecedent of the resulting association rule. The inference rules in the two aroaches are sound, but neither aroach infers the confidence of an association rule. We found that the rules generated by these aroaches may be redundant. If X Y is generated by any of these aroaches, it may be ossible to derive it 3.1 Definitions Definition Given a dataset D with I as a set of items and T as a set of transactions, an association rule X Y over a relation R I T is said to be in canonical form if Y = 1. According to this definition, we only consider the case from simler valid association rules. of Y containing a single item. Before we introduce other Examle 2.1. Suose we have the dataset shown in new notions, let us discuss another concet related to Table 2.1, and that minsu is 3 and minconf is 6. The conditional robability. sets of rules generated by the revious aroaches are A conditional robability distribution (CPD) P(Y X) shown in Table 2.2. is defined as P(X, Y) / P(X), where X and Y are random Consider the rules in Table 2.2 from the ersective variables [10]. Y is conditionally indeendent of Z given of a user. For RR, CE AD can be derived from simler X, denoted as I(Y, Z X), if and only if P(Y X) = P(Y X, association rules CE A and CE D by right union as Z), where X, Y, and Z are three disjoint sets of random with Armstrong axioms, and variables. The statement I(X, Z X) is referred to as a conf(ce AD) = conditional indeendence statement (CIS) [10] conf(ce A) conf(ce D). For GB, AB D is Four roerties that are satisfied by any joint unnecessary, because B D is in GB. For C, B AD can be derived from B A and B D by right union, assuming B A and B D are valid association rules. Transitivity cannot be used between GB and RI. For examle, E D in GB and D A in RI cannot be used to infer a valid association rule by transitivity. These examles show that some association rules in these generating sets are not in the most desirable form. ڤ 3 Discovery of Basic Association Rules We roose a new aroach to solve the roblems mentioned in Section 2. Table 2.2. The Generated Association Rules Algorithm Set Rule #Rule FastGenRules AR 36 FastGenRR RR CE AD, C AD, D E, D C, D A, B AD, AC DE, C DE, AE CD, E CD, A CD GB GB and RI C D, E D, AB D, E CD, CoverRules RI C CE D, B D, AC D, A D CE AD, C AD, D E, D C, D A, B AD, AC DE, C DE, E CD, A CD E CD, D E, C AD, D C, B AD, CE AD, D A robability distribution (JPD) are symmetry, decomosition, weak union, and contraction [10]. For examle, for the decomosition roerty, if I(X, Y W Z), then I (X, Y Z) and I (X, W Z). We describe a CIS in another way in Lemma Lemma Given a subset X X, we say I(Y, X \ X X ), i.e., Y is conditionally indeendent of X \ X given X, if and only if P(Y X) = P(Y X ), where X and Y are two disjoint sets of variables. Proof: Suose Z = X \ X. Then X, Y, and Z are three disjoint sets of variables. Since X = X Z, P(Y X) = P(Y X, Z). Assuming P(Y X) = P(Y X ), we derive P(Y X, Z) = P(Y X ). Thus, we have I(Y, Z X ), i.e., I(Y, X \ X X ) according to the definition of CIS. For the converse, assuming I(Y, X \ X X ), then according to the definition of CIS, we have P(Y X, X \ X ) = P(Y X ), i.e., P(Y X) = P(Y X ). ڤ Furthermore, we have Lemma Lemma Given two disjoint sets of variables X and Y, and X X, if I(Y, X \ X X ), then Z X \ X, I(Y, Z X ), i.e., Z X \ X, P(Y X) = P(Y X ) = P(Y X, Z). Proof: If Z = X \ X, the roof follows immediately. The decomosition roerty of conditional indeendencies states that if I(X, Y W Z), then I (X, Y Z) and I (X, W Z). Because I(Y, X \ X X ), Z X \ X, we obtain I(Y, Z X ) and I(Y, {X \ X } \ Z X ). From I(Y, X \ X X ), we obtain P(Y X) = P(Y X, X \ X ) = P(Y X ), and from I(Y, Z X ), we obtain P(Y X ) = P(Y X, Z). Hence, P(Y X) = P(Y X ) = P(Y X, Z). ڤ To exlain this idea more fully, we first give a more general definition as follows. Definition Given two disjoint sets of variables X and Y, and a conditional robability distribution P(Y X), a restricted conditional robability distribution (RCPD),

4 denoted as Pˆ (Yˆ Xˆ ), is a subset of P(Y X) defined by secifying subsets Domain(Yˆ ) Domain(Y) and Domain( Xˆ ) Domain(X). For binary variables X {0,1} and Y {0,1}, with Xˆ {1} and Yˆ {1}, the confidence (y x) of the association rule r: X Y is a ositive conditional robability of the RCPD Pˆ (Yˆ Xˆ ). In the following discussion, Pˆ (Yˆ Xˆ ) is simly denoted as Pˆ (Y X). Given X X, if we have Pˆ (Y X) = Pˆ (Y X ), we cannot guarantee that Z X \ X, Pˆ (Y X) = Pˆ (Y X, Z). Hence, the decomosition roerty cannot be alied in a RCPD. Examle Suose that we have the transaction dataset in Table 2.1 and that minsu is 3 and minconf is 6. Consider two association rules ACD E and D E. Although conf(acd E) = conf(d E) = 2/3, conf(cde) = 3/4, i.e., if X = {A, C, D} and Y = {E}. If we choose X ={D}, then Z = X \ X = {A, C}, (y x) = (y x, z) = (y x ) = 2/3, i.e., Pˆ (Y X) = Pˆ (Y X ). Let Z = {C} X \ X, since (y x, z) = 3/4, then (y x) (y x, z) and (y x ) (y x, z), i.e., Pˆ (Y X) Pˆ (Y X, Z) and Pˆ (Y X ) Pˆ (Y X, Z). ڤ In the context of RCPDs, we hoe to find a minimal subset X of X with resect to Y such that Pˆ (Y X) = Pˆ (Y X ), and Z X \ X, Pˆ (Y X) = Pˆ (Y X, Z). Definition Let X and Y be two disjoint sets of variables. X is a minimal conditional subset (MCS) of X with resect to Y if X is a subset of X that satisfies the following conditions: (1) Pˆ (Y X) = Pˆ (Y X ). (2) Z X \ X, Pˆ (Y X) = Pˆ (Y X, Z). (3) / X X such that conditions (1) and (2) hold for X. If X is a minimal conditional subset of itself with resect to Y, then X is conditionally minimal with resect to Y. For examle, in Table 2.1, AC is conditionally minimal with resect to E. We also define the dual. Definition Suose X and Y are disjoint sets of variables. If X is a MCS of X with resect to Y, then the restricted conditional robability distribution Pˆ (Yˆ Xˆ ) is a minimal conditional robability distribution (MCPD) of the RCPD Pˆ (Yˆ Xˆ ) with resect to Yˆ. If X is conditionally minimal with resect to Y, then Pˆ (Yˆ Xˆ ) is a MCPD of itself. If Pˆ (Yˆ Xˆ ) is a MCPD of Pˆ (Yˆ Xˆ ), then a series of equations follow, i.e., Z X \ X, Pˆ (Yˆ Xˆ ) = Pˆ (Y ˆ Xˆ, Ẑ ). Because MCS and MCPD are duals of each other, we use whichever is convenient. In revious research, MCS and MCPD have not been defined or emhasized by researchers. In the context of mining association rules, we use a RCPD Pˆ (Y X) for inference on association rules. We do not consider how Pˆ (XY) and Pˆ (X Y) behave. Examle Suose we have the transaction dataset shown in Table 2.1. Let X = {A, C, D}, Y = {E}, X = {A, C}. Because the confidences of both X Y and X Y are 2/3, i.e., the ositive conditional robability (e acd) = (e ac), we see that Pˆ (Y X) = Pˆ (Y X ) and that Z X \ X = {D}, Pˆ (Y X) = Pˆ (Y X, Z). Because (e acd) (e a) = 2/4 and (e acd) (e c) = 3/4, there is no X X = {A, C} such that Pˆ (Y X) = Pˆ (Y X ). So X is a MCS of X with resect to Y. Pˆ (Y X ) is a MCPD of Pˆ (Y X) with resect to Y. Similarly, {A}, {C}, and {E} are three MCSs of {A, C, E} with resect to D. ڤ In the context of mining association rules, a CPD P(Y X) is always referred to as a RCPD Pˆ (Yˆ Xˆ ). We define a new notion of minimal association rules analogously to minimal functional deendencies [25]. Definition Suose we have a set I of items and a transaction dataset D. A canonical association rule X Y over D is a basic association rule if X is conditionally minimal with resect to Y, i.e., / X X such that (1) P(Y X) = P(Y X ) and (2) Z X \ X, P(Y X) = P(Y X, Z). For examle, given the transaction dataset shown in Table 2.1, AC E is a basic association rule, and AC is a MCS of ACD with resect to E while P(E AC) is a MCPD of P(E ACE) with resect to E. 3.2 Comuting MCPDs According to the definition of basic association rules, either a MCS X with resect to Y or a MCPD P(Y X) corresonds to a basic association rule X Y. The confidence of the rule is a ositive conditional robability of P(Y X). Therefore, the crucial task of finding basic association rules is the comutation of all MCPDs. Given a set L of frequent itemsets, X L, and minconf, our aroach for comuting MCPDs corresonding to X is divided into two stes. We first construct a set of RCPDs in canonical form from X, and then we comute their MCPDs, in which all ositive conditional robabilities are at least as great as minconf. Our aroach is similar to the aroach for discovering the minimal directed I-Ma of a joint robability distribution (JPD) [10]. Suose that a ermutation (ordering) Y = {Y 1,,Y n } of a set of variables X = {X 1,,X n }, and (x) is a JPD of X. This aroach comutes any minimal set of redecessors i with resect

5 to Y i, and i satisfies (y i b i ) = (y i π i ), where i B i = {Y 1,,Y i-1 }. Hence, a directed minimal I-ma of (x) is constructed by designating i as arents of Y i. The differences from comuting I-ma from a JPD is that we do not ermutate items of a frequent itemset, and the conditions (contexts) in the restricted conditional robabilities contain all items in the frequent itemset excet for a single test item. For examle, to find MCPDs corresonding to the frequent itemset X 1 X 2 X 3 X 4 X 5, first the following RCPDs are constructed: P(X 1 X 2 X 3 X 4 X 5 ), P(X 2 X 1 X 3 X 4 X 5 ), P(X 3 X 1 X 2 X 4 X 5 ), P(X 4 X 1 X 2 X 3 X 5 ), and P(X 5 X 1 X 2 X 3 X 4 ). For P(X 1 X 2 X 3 X 4 X 5 ), one can observe all MCSs of X 2 X 3 X 4 X 5 with resect to X 1, such as a minimal subset {X 2 X 3 X 4 X 5 }, and Z {X 2 X 3 X 4 X 5 } \, (x 1 x 2 x 3 x 4 x 5 ) = (x 1 π) = (x 1 π, z), where P(X 1 ) is a MCPD of P(X 1 X 2 X 3 X 4 X 5 ). If (x 1 x 2 x 3 x 4 x 5 ) is less than minconf, then we sto the comutation of MCPDs. Otherwise, is a MCS of X 2 X 3 X 4 X 5 with resect to X 1. The corresonding basic association rule is X 1, as exlained in Section 3.1. Similarly, we comute the MCPDs of P(X 2 X 1 X 3 X 4 X 5 ), P(X 3 X 1 X 2 X 4 X 5 ), P(X 4 X 1 X 2 X 3 X 5 ), and P(X 5 X 1 X 2 X 3 X 4 ). Finally, a set of MCPDs is obtained. Thus, for each frequent itemset, a set of basic association rules with resect to X can be found. Because RCPDs do not obey decomosition, to comute a MCPD P(Y X ) of P(Y X), we should examine all cases, i.e., Z X \ {Y X }, P(Y X ) = P(Y X, Z). This rocess goes from to to bottom, and is deicted as a semi-lattice [12] in Figure 3.1. The frequent itemset itself, such as ABCDE, is laced at the first level. At the second level, we construct a set of RCPDs, such as P(A BCDE), P(B ACDE), P(E ABCD), P(D ABCE), and P(E ABCD). All RCPDs at the third level are constructed by setting their contexts as maximal subsets of the contexts of the RCPDs at the second level. At the fourth level, all RCPDs are formed by setting their contexts as the intersections of the contexts of the RCPDs at the third level. The number k of itemsets being intersected at level d of the semi-lattice is related to the length l of the frequent itemset at the first level. We have the formula k = d 2, 4 d l. For examle, for frequent 4-itemsets, the number of itemsets being intersected at level 4 is 2. For frequent 5-itemsets, the number of itemsets being intersected at level 5 is 3. The deth of the semi-lattice equals the size of the corresonding frequent itemset. For examle, for the comutation of the MCPDs of P(E ABCD), we first examine P(E ABC), P(E ABD), P(E ACD), and P(E BCD). If P(E ABCD) = P(E ABC) and P(E ABC) is minimal, then P(E ABC) is already a MCPD of P(E ABCD), and we examine P(E ABD), P(E ACD), and P(E BCD). Similar cases arise ABCDE P(A BCDE) P(B ACDE) P(E ABCD) P(D ABCE) P(C ABDE) level 2... P(E ABC) P(E ABD) P(E ACD) P(E BCD)... P(E AB) P(E AC) P(E AD) P(E BC) P(E BD) P(E CD) P(E A) P(E B) P(E C) P(E D) level 1 level 3 level 4 level 5 Figure 3.1. A Possible Semi-lattice for the Frequent Itemset ABCDE. for P(E ABD), P(E ACD), and P(E BCD). If P(E AB) is a MCPD of P(E ABCD), we only require P(E ABCD) = P(E ABC) = P(E ABD). The context of P(E AB) is the intersection of the contexts of P(E ABC) and P(E ABD). If P(E A) is a MCPD of P(E ABCD), we require P(E AB) = P(E AC) = P(E AD) = P(E ABCD), where P(E AB), P(E AC), and P(E AD) have already been obtained. Therefore, according to Definition and 3.1.4, P(E A) is a MCPD of P(E ABCD). During extension of the semi-lattice, the RCPDs in new children (the nodes at the next level) are also called the candidate MCPDs (not unique). We reduce the number of candidate MCPDs of new children by intersecting itemsets in the contexts of their arents (nodes at the revious level), and checking whether the candidate MCPDs of the children are equal to those of their arents, and whether the ositive conditional robabilities of their MCPDs are at least as great as minconf. To find all basic association rules, we should check all frequent itemsets and comute the corresonding minimal conditional robabilities. Therefore, we define the following concets. Definition Suose we have a transaction dataset D, minsu, minconf, and a set L of frequent itemsets over D. A class of basic association rules for X L is a set of basic association rules r:x A, denoted as C r (X), such that (1) X A X (2) X is a MCS of X \ A with resect to A (3) conf(r) minconf Definition Suose we have a transaction database D, minsu, and minconf. A basic association rule system, denoted as BR(D, minsu, minconf), is defined as the set of k distinct classes of valid basic association rules, i.e., BR(D, minsu, minconf) = {C i i r C r is a class of basic

6 association rules, 1 i k such that for all j, 1 j k, j i i, C r C j i r, C r C j r }. BR(D, minsu, minconf) is also simly denoted as BR when D, minsu, and minconf are clear from context. Examle Given the dataset in Table 2.1, minsu = 3, and minconf = 6, in Figure 3.2, we show how to comute all MCPDs corresonding to the frequent itemset ACDE. Figure 3.2 shows the search sace used to find MCSs by comuting a set of MCPDs corresonding to the frequent itemset ACDE. This search sace consists of four semi-lattices, in which the single node at the to level corresonds to the frequent itemset, and other nodes corresond to its subsets, each of which includes a ositive conditional robability. Regardless of the data in the dataset, at the second level in the structure, we always have four ositive conditional robabilities for ACDE. Each of them corresonds to an item in ACDE, such as (a cde), etc. At the third level of the structure, the itemsets aearing in the contexts of ositive conditional robabilities are always maximal subsets of the itemsets aearing in the context of ositive conditional robabilities in their arents. For examle, along the branch containing (a cde), de, ce and cd are maximal subsets of cde. If the ositive conditional robability of a child is equal to the ositive conditional robability of its arent and both ositive conditional robabilities are at least as great as minconf, then in Figure 3.2, they are connected with a bold arrow; e.g., a bold arrow is shown from the node including (a cde) to the node including (a ce), because (a cde) = (a ce). If the ositive conditional robability of a arent is not equal to the ositive conditional robability of its child, but the ositive conditional robability of the child is at least as great as minconf, we connect the arent node and the child node with a narrow arrow; e.g., a narrow arrow is shown from the node including (a cde) to the node including (a cd). If the ositive conditional robability of a arent is not equal to the ositive conditional robability of its child or a ositive conditional robability is less than minconf, further comutation of minimal conditional ACDE (a cd e) (c ade) (d ace) (e acd) (a de) (a ce) (a cd ) (c de) (c ae) (c ad) robabilities along this ath is terminated; e.g., a dotted arrow is shown from the node including (a cde) to the node including (a de). Hence, P(A CE) is a MCPD of P(A CDE). Similarly, we comute all MCPDs of P(C ADE), P(D ACE) and P(E ACD). As a result, a set of MCPDs with resect to the frequent itemset ACDE is obtained, i.e., P(A CE), P(C AE), P(D E), P(D C), P(D A) and P(E AC), where P(A CE) is a MCPD of P(A CDE) with resect to A, and P(C AE) is a MCPD of P(C ADE) with resect to C, etc. ڤ From the MCPDs corresonding the frequent itemset X, we can readily obtain a class of basic association rules, C r (X). A class of basic association rules derived from one frequent itemset may be comletely included in another class of basic association rules derived from another frequent itemset. Hence, classes of basic association rules that are comletely contained in other classes of basic association rules have no more information than the classes containing them. They are called redundant classes and are discarded. Examle Given the transaction dataset in Table 2.1, minsu = 3, and minconf = 6, from Examle 3.2.1, we obtain C r (ACDE) = {CE A, AE C, E D, C D, A D, AC E}. Similarly, we also obtain another class of basic association rules corresonding to the frequent itemset ADE, C r (ADE) = {A D, E D}. Since C r (ADE) C r (ACDE), C r (ADE) is discarded. ڤ 3.3 The GenBR Algorithm We roose the GenBR algorithm for generating BR. Its goal is different from that of the second ste of the Ariori algorithm [1], which generates all association rules. Our aroach consists of two main stes. Given a set of frequent itemsets and minconf, GenBR generates all classes of basic association rules. Secondly, the algorithm generates BR by discarding all redundant classes. The GenBR algorithm, resented in Figure 3.3, generates BR from a set of frequent itemsets L. For each frequent itemset I in L, the GenBC algorithm is called to generate a class of basic association rules corresonding to I. All classes discovered by GenBC are collected into (d ce) (d ae) (d ac) (e cd ) (e ad) (e ac) (a e) (a d ) (a c) (c e) (c d) (c a) (d e) (d c) (d a ) (e d) (e c) (e a) Figure 3.2. The Comutation of MCPDs.

7 Algorithm GenBR(L) Algorithm MinimalSubsets(i, I ) Purose: generate BR from a set L of frequent itemsets Purose: comute a set S of minimal conditional subsets of I Inut: L, a set of frequent itemsets, where L k L is all itemsets with resect to i. in L containing k items. Inut: i, an item such that I i is a frequent itemset. Outut: BR, a set of classes of basic association rules. I, an itemset. BAR = Ø Outut: S, a set of minimal conditional subsets of foreach I L k, k 2 I with resect to i. BAR = BAR GenBC(I) S = Ø end = su(i {i}) / su(i ) BR = RemoveRedundantClass(BAR) if ( minconf) then return BR S = {I } end S = MaximalSubsets(I ) Figure 3.3. The GenBR Algorithm. k = 2 Algorithm GenBC(I) while (S Ø) Purose: generate the class of basic association rules corresonding to I. foreach s S Inut: I, a frequent itemset. Outut: R, a class of basic association rules corresonding to I. 1 = su(s {i}) / su(s) if ( 1 = ) then R = Ø S = S {s} foreach item i I else S = S \ {s} I = I \ {i} end S = MinimalSubsets(i, I ) foreach s S foreach s S DelSuerset(s, S) // remove all s S such that R = R {s i} end s s from S return R S = IntersectionSet(S, k) end k = k + 1 Figure 3.4. The GenBC Algorithm. end return S BAR. The RemoveRedundantClass algorithm removes end redundant classes from BAR to give BR. Figure 3.5. The MinimalSubsets Algorithm. The GenBC algorithm, resented in Figure 3.4, generates the class of basic association rules corresonding to the frequent itemset I. The main loo of the GenBC algorithm is reeated for each item i in the frequent itemset I. It calls the MinimalSubsets algorithm for comuting the MCSs of I with resect to i. From these MCSs, the algorithm forms a set of basic association rules corresonding to I. The MinimalSubsets algorithm, resented in Figure 3.5, comutes the set of MCSs of I with resect to i. First, the algorithm determines whether = (i I ) minconf. If so, the algorithm initializes the minimal conditional subset S = {I }. The MaximalSubsets function roduces a set S of all maximal subsets of I as a set of candidate MCSs of I with resect to i, e.g., S = MaximalSubsets(BDE) = {BD, BE, DE}. Because this function is straightforward, we omit it. In the while loo, the algorithm examines the validity of candidate MCSs in S. For s S, the algorithm comutes the conditional robability (i s), and then comares (i s) with. If (i s) =, then s is a valid candidate MCS of I with resect to i, and it is stored in S. If smaller valid candidate MCSs of I are found, then suersets of them are removed from S. The DelSuerset function (omitted) does this task. The IntersectionSet(S, k) function (omitted) generates all smaller candidate MCSs of I, which are the intersections of itemsets in S in terms of the deth k of loo. For examle, the intersection of the two itemsets AE and AC is equal to A, and it is regarded as a candidate MCS. Examle Given the dataset in Table 2.1, minsu = 3, and minconf = 6, we describe the rocess of generating BR using GenBR. In the while loo of GenBR, we assume I = {ACDE} is selected from L. The GenBC algorithm is called to comute the class of basic association rules corresonding to ACDE. The main loo of the GenBC algorithm is reeated for each item in a frequent itemset. Inside the main loo, the MinimalSubsets algorithm is first called to generate all MCSs of CDE with resect to A. I = CDE and i = A. Because the ositive conditional robability (a cde) minconf, the MinimalSubsets algorithm s comuting MCPDs of P(A CDE), i.e., a set S of MCSs of CDE with resect to A. Initially, S = {I }. MaximalSubsets roduces a set S of all maximal subsets of CDE as candidate MCSs of

8 CDE with resect to A, S = {DE, CE, CD}. In the while loo, itemsets in S are checked to see if they are candidate MCSs of CDE with resect to A. When CE is examined, (a ce) = (a cde), so we obtain S = {BDE, CE}. For DE and CD, because (a de) (a cde) and (a cd) (a cde), DE and CD are removed from S, and S = {CE}. After DelSuerset, S = {CE} and S = {CE}. Because S has only one itemset, S = IntersectionSet(S, k) = Ø. The while loo in the MinimalSubsets algorithm exits, and S = {CE} is returned to GenBC. In GenBC, the basic association rule {CE A} is laced in R. Similarly, from the comutation of P(C ADE), we have a new basic association rule AE C, and R = {CE A, AE C}. From the comutation of P(D ACE), we have new basic association rules E D, C D and A D, and they are added into R. From the comutation of P(E ACD), a new basic association rule AC E is found. Finally, the algorithm generates a class of basic association rules corresonding to ACDE, R = {CE A, AE C, E D, C D, A E, AC E}. The algorithm returns R to GenBR. At this oint, BR = {{DE A, A B, D B, E B, A E}}. GenBR continues until all frequent itemsets in L, with size of at least 2, have been rocessed. Consequently, all basic association rules and their classes are generated and resented, as shown in Table 3.1. ڤ We define the following terminology over BR: (1) C r denotes a class of basic association rules. C i is a set of items, which aear in C r. (2) The class in which a basic association rule X Y resides is denoted as C r (X Y). For examle, C r (C D) might be referred to as one of C r (CDE), C r (ACD), C r (ACDE) and C r (CD), as shown in Table 3.1. (3) C i (X Y) denotes a set of items, which aear in C r (X Y). e.g, E C i (C D) ={C, D, E}. All redundant classes in Table 3.1 are discovered and discarded by the RemoveRedundantClass algorithm. A Table 3.1. A Set of Classes of Basic Association Rules. No. Class Basic association rule Redundant 1 C r (AB) B A Yes 2 C r (BD) B D Yes 3 C r (DE) E D, D E 4 C r (AD) D A, A D Yes 5 C r (CE) E C, C E Yes 6 C r (AC) A C, C A Yes 7 C r (CD) D C, C D 8 C r (ABD) B D, B A, D A, A D 9 C r (ADE) E D, A D Yes 10 C r (ACE) CE A, AC E, AE C Yes 11 C r (CDE) E D, E C, C E, C D 12 C r (ACD) A D, A C, C D, C A 13 C r (ACDE) E D, CE A, AC E, A D, AE C, C D redundant class C r (I) contains no more information than the class C r (I ) containing C r (I). For examle, in Table 3.1, C r (AB) C r (ABD). If a C r (I 1 ) is redundant and C r (I 1 ) C r (I 2 ), then I 1 I 2. All frequent itemsets are arranged in a semi-lattice based on their inclusion relation in order to discover all redundant classes. For examle, given the dataset shown in Table 2.1 and minsu = 3, all frequent itemsets form the semi-lattice shown in Figure 3.6. To identify the redundant class C r (AB), RemoveRedundantClass comares C r (AB) with C r (ABD) because AB ABD. To check whether C r (DE) is redundant, the algorithm comares C r (DE) with C r (ADE) and C r (CDE). Because C r (DE) C r (ADE) and C r (DE) C r (CDE), the algorithm does not need to comare C r (DE) with C r (ACDE), and C r (DE) is non-redundant. That means that to check whether C r (I 1 ) is redundant, we only comare C r (I 1 ) with all C r (I 2 ), where I 1 and I 2 are frequent itemsets, I 1 I 2, and I 1 = I 2-1, as justified by Theorem Theorem Given three itemsets I 1, I 2, and I 3, and I 1 I 2 I 3, if C r (I 1 ) C r (I 2 ), then C r (I 1 ) C r (I 3 ). Proof: Suose the condition is satisfied. Hence, r: X A C r (I 1 ), and a MCP (a x), but r C r (I 2 ), i.e., Z I 2 \ {X, A}, (a x) (a x, z). Because I 2 \ {X, A} I 3 \ {X, A}, Z I 3 \ {X, A}, (a x) (a x, z), i.e., P(A X) is not a MCPD of P(A I 3 \ A). Hence, r C r (I 3 ). ڤ In [22], we roosed an inference system for basic association rules. It is called the C-inference system, because it ermits a rule s confidence to be inferred. We roved that the C-inference system holds on BR and that all association rules can be derived from BR by the alication of inference rules in the C-inference system. The C-inference system is summarized in Table 3.2, where and q signify the confidences of association rules. Rules that cannot be derived from other rules by the C- inference system are called non-redundant association rules. ABD ACE ACDE ACD ADE CDE AB AC AD AE BD CD CE DE A B C D E L Hashtable Figure 3.6. A Semi-Lattice for All Frequent Itemsets

9 Table 3.2. The C-inference System on Basic Association Rules Inference Rule Premise Conclusion Augmentation X Y C r, X C i \ {X, Y} XX Y Pseudo-Transitivity q q X Y C r, XW Z C r XW Z Contraction q q X Y AR, XY Z AR X YZ Additivity q q X Y C r, W Z C r XW YZ Right union q q X Y C r, X Z C r X YZ Left union X Z C r, Y Z C r XY Z 3.4 The Comutational Comlexity of GenBR Let us analyze the comutational comlexity of GenBR. From Figure 3.1, we observed that itemsets contained in the context of RCPDs may be used to construct a secific semi-lattice for comuting MCPDs. All nodes in the semi-lattice are subitemsets of an itemset. This construction corresonds to the roblem of generating subsets in mathematics, or creating a hyercube (or a k-cube), or a Q n grah in grah theory [15]. We analyze the comutational comlexity of our algorithm based on the binomial theorem [13] as follows. According to the binomial theorem, the number of subitemsets of a k-itemset (nodes of the semi-lattice), denoted NS, is given by: NS = C 1 k + +C k k = 2 k 1 (3.1) By counting edges from the to of the semi lattice for a k-itemset, as shown in Figure 3.7, to its bottom, the number of edges, denoted NE, of the semi-lattice is given by: k 1 k NE = C k + 2C 2 k 3 1 k + 3C k + +(k 1)C k k 1 k = C k + 2C 2 k 3 k + 3C k + +(k 1)C 1 k + kc 0 k 0 kc k k k k i ik! = ick k = k i!( k i)! i= 1 k i= 1 ( k 1)! = k k ( i 1)!( k i)! i=1 k k 1 k 1 i 1 2 = k C = k k (3.2) i= 1 From Equation (3.1), we obtain the number of association rules NR for a k-itemset generated by the FastGenRules algorithm [1] as follows: NR = 2 k 2 (3.3) According to Figure 3.1, to comute the MCSs of ABCD with resect to E, in the worst case, the rocess forms a semi-lattice of ABCD. One edge in the semilattice corresonds to one comarison. For the whole itemset ABCDE, there are five semi-lattices of this kind. Hence, from Equation (3.2), the comutational comlexity TC of GenBR for a k-itemset is determined by: TC = k(k 1)(2 k 2 1) k 2 2 k 2 = O(k 2 2 k 2 ) (3.4) Given that l is the length of the longest itemsets in the set of frequent itemsets, the ratio of the comlexity of GenBR to that of FastGenRules is: R = O(l 2 2 l 2 )/O(2 l+1 ) = O(l 2 / 2 3 ) = O(l 2 ) (3.5) Although the time comlexity of GenBR exonentially increases with l, we show that GenBR erforms very well over actual datasets in the following section. 4 Comarison and Exerimental Results The overall comarison between our aroach and four revious aroaches is shown in Table 4.1. Table 4.1. The Overall Evaluation Algorithm Features Fast- Rules Fast- RR GBRI Cover- Rules GenBR Canonical form Non-redundant Infer a rule s suort Infer a rule s confidence Armstrong axioms-like #Rules All Fewer Most Fewest Best Elased time Fastest Slower Slowest Medium Fast First, consider the form of rules generated. Only GenBR generates non-redundant rules in canonical form. We believe that this tye of rules is easy for users to understand. Rules generated by GenBR are nonredundant, i.e., these rules and their confidences cannot be derived from simler rules and their confidences by using the inference rules defined in any of the other aroaches. Secondly, consider whether an inference rule ermits the suort and confidence of rules to be inferred. None of the aroaches can infer a rule s suort, and only GBRI and GenBR can infer confidences. Thirdly, consider whether the inference system resembles Armstrong s axioms; only the C-inference system does so. Fourthly, consider the number of rules generated. CoverRules generates the fewest, but GenBR organizes the basic association rules into classes related to the frequent itemsets. GenBR also tends to generate the fewest rules when minconf is high. Finally, with resect to the elased time, FastGenRules is the fastest on all datasets. GenBR is more efficient than the other aroaches, excet for a few cases. As described in the remainder of this section, exeriments on several synthetic and real-life datasets were conducted to comare the erformance of GenBR and revious algorithms with resect to the number of rules and the elased running time.

10 4.1 Exerimental Design The exerimental environment was a PC with a 2.53 GHz Intel CPU, 512MB of RAM, and Microsoft Windows XP. All algorithms were imlemented in Microsoft Visual Java++. The datasets used are listed in Table 4.2. Synthetic datasets, T10I4D100K and T20I6D100K, were generated by running GenData [17], which is a synthetic data generator from IBM. Several well-known benchmark real-world datasets were chosen from [29]. CustInfo-5 was obtained from the CustInfo database [4] by joining five tables. The number of attributes shown for the realworld datasets reresents the number after multi-valued attributes were converted to several binary attributes. Table 4.2. Synthetic and Real-world Datasets Dataset # Attributes #Items Length of Record or Transactions # Transactions Chess Connect Mushroom Molecular CustInfo T10I4D100K T20I6D100K To comare the erformance of GenBR and revious aroaches with resect to the number of rules and the elased running time, we imlemented the FastGenRules for generating association rules [2] as well as the GB algorithm for generating generic basis for exact rules, and the RI algorithm for generating informative basis for aroximate association rules [5]. The GB and the RI algorithms were combined into the GBRI algorithm for our exeriments. The FastGenReresentatives (denoted FastGenRR) algorithm generates a set RR of reresentative association rules [20]. CoverRules generates an informative cover of cover rules [11]. The GenBR algorithm consists of two stes to generate basic association rules. The first ste generates all classes of basic association rules. The second ste removes redundant classes. The GenBR algorithm is more similar to the FastGenRules algorithm than to the other algorithms. To comare the GenBR with FastGenRules with resect to the elased running time, we only consider the first ste of GenBR, which generates the basic association rules, and we ignore the elased time required for the second ste. 4.2 Results We first comared the algorithms with regard to the number of rules generated. GenBR generates fewer rules than FastGenRules, and in some cases fewer rules than FastGenRR, GBRI, and CoverRules. Table 4.3. Chess, minsu = 80 % Algorithm / minconf Fast- #Rules Elased time Rules (ms) #Rules GBRI Elased time Fast- #Rules RR Elased time Cover- #Rules Rules Elased time GenBR #Classes #Rules Elased time For Chess with minsu = 80%, the number of rules generated by the algorithms and their elased times are shown in Table 4.3. GenBR generates significantly fewer rules than FastGenRules for all settings of minconf. For examle, when minconf = 10%, FastGenRules generates rules, while GenBR generates rules. Among all aroaches, CoverRules generates the fewest rules when minconf is 90% or less. For examle, in Table 4.3, with minconf = 10%, CoverRules generates 226 rules while GenBR generates rules. However, when minconf is 100%, GenBR generates the fewest rules. For examle, in Table 4.3, GenBR generates only 6 rules, while the other aroaches generate 2228 or more rules. Table 4.4. Comarison wrt the Number of Rules. Dataset Minsu Fast- Rules Fast- RR GBRI Cover- Rules GenBR Chess 80% Chess 90% Connect-4 95% Connect-4 97% Mushroom 30% Mushroom 40% Molecular 5% Molecular 10% CustInfo-5 5% CustInfo-5 10% T10I4D100K 05% T10I4D100K 1% T20I6D100K 2% T20I6D100K 3% For other datasets and different arameters, the exerimental results concerning the number of rules generated are similar to those shown in Table 4.3. We summarize these exerimental results in Table 4.4. Because users are generally concerned with association rules with high confidences, we only describe the number of rules with 100% confidence (minconf = 100%) for all datasets besides CustInfo-5. For CustInfo-5, since there is no exact association rule when minsu = 5% or 10%, we use minconf = 90%. Under these conditions, GenBR

11 generates fewer rules than the other algorithms, excet for a few cases when CoverRules generates the fewest. Although the comutation comlexity of GenBR is exonentially greater than that of FastGenRules (as mentioned in Section 3.4), in our exeriments, we observed that the elased time for GenBR is significantly less than that for FastGenRules, excet for cases where both algorithms have their fastest erformance, which corresond to values of minconf aroaching 100%. For examle, in Table 4.3, when minconf is 10%, the elased time for GenBR is ms while that of FastGenRules is ms. When minconf is 100%, the elased time of GenBR is ms while the elased time of FastGenRules is 969 ms. Across all tested settings of minconf from 10% to 100%, GenBR has the lowest maximum elased time (25593 ms). The number of exact association rules greatly affects the elased time of GenBR. If there are more exact association rules, there may be more functional deendencies, and consequently, GenBR sends more time. We summarize the exerimental results related to elased time by recording the ratios of the elased time of GenBR to revious algorithms in Table 4.5. We set minconf = 100%. The bold entries identify cases where the corresonding algorithm is faster than GenBR. The results show that GenBR is faster than all algorithms excet FastGenRules on most resented datasets. Table 4.5. Evaluation wrt Elased Time Dataset Minsu Fast Fast Cover- Rules RR GBRI Rules Chess 80% Chess 90% Connect-4 95% Connect-4 97% Mushroom 30% Mushroom 40% Molecular 5% Molecular 10% CustInfo-5 5% CustInfo-5 10% T10I4D100K 05% T10I4D100K 1% T20I6D100K 2% T20I6D100K 3% Conclusions and Future Work 5.1 Conclusions In this aer, we roosed a new tye of association rules, called basic association rules. First, by referring the relational database theory on functional deendencies, we develoed the new concets of a restricted conditional robability distribution (RCPD), a minimal conditional robability distribution (MCPD), a minimal conditional subset (MCS), and a basic association rule. We established the C-inference system on basic association rules, which is similar to Armstrong s axioms on functional deendencies. Secondly, we roosed the GenBR algorithm for generating a set of classes of basic association rules from a set of frequent itemsets. GenBR efficiently generates basic association rules as comared with the revious aroaches for generating small sets of association rules. GenBR also generates fewer rules than revious aroaches when minconf is high. Thirdly, arguably, users will find basic association rules to be more manageable and understandable than reviously roosed reduced sets of association rules. The rules are concise, the number of rules is small, redundancy among rules in a class of basic association rules has been eliminated, and inference involving confidence values is ossible on the rules. This oint is argued at greater length in [22]. Fourthly, we showed that the search sace of our algorithm to comute basic association rules is a hyercube (n-cube) or Q n grah. This insight aided in our theoretical analysis of the algorithm. 5.2 Future Work An oen roblem is to find a more efficient algorithm for discovering basic association rules from frequent itemsets. Finding a heuristic method to discover basic association rules without generating frequent itemsets will also be challenging. We roosed the idea of a restricted conditional robability distribution as a foundation for mining association rules, and we distinguished it from the traditional conditional robability distribution. It can be regarded as a more general concet than the contextsecific conditional robability distribution described in [6]. Further work on restricted conditional robability distributions may be romising. We established the C-inference system on association rules, which may be regarded as an extension of Armstrong s axioms on functional deendencies. This kind of inference system works on exact and aroximate rules simultaneously. Further exloration of the roerties of the C-inference system is needed. References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. SIGMOD 93, ages , Washington. DC, May [2] R. Agrawal and R. Srikant. Fast algorithm for mining association rules. In Proc. VLDB 94, Santiago, Chile, [3] E. Baralis and G. Psaila. Designing temlates for mining association rules. Journal of Intelligent Information Systems, 9(1):7-32, July [4] B. Barber and H. J. Hamilton. Extracting share frequent itemsets with infrequent subsets. Data

12 Mining and Knowledge Discovery, 7(2): , Aril [5] Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal. Mining minimal non-redundant association rules using frequent closed itemsets. In Proc. of the First International Conference on Comutational Logic, ages , 200 [6] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-secific indeendence in Bayesian networks. In Proc. 12 th Conf. on Uncertainty in Artificial Intelligence (UAI-96), ages , Portland, Oregon, [7] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlation. In Proc. SIGMOD 97, ages , May [8] S. Brin, R. Motwani, J. D. Ullman and S. Tsur. Dynamic itemset counting and imlication rules for market basket data. In Proc. SIGMOD 97, ages , Montreal, Canada, June [9] C. L. Carter, H. J. Hamilton, and N. Cercone. Share based measures for itemsets. In Proc. PKDD 97, ages 14-24, Trondheim, Norway, June [10] E. Castillo, J. M. Gutierrez, and A. S. Hadi, Exert Systems and Probabilistic Network Models, Sringer, [11] L. Cristofor and D. Simovici. Generating an informative cover for association rules. In Proc. of the IEEE International Conference on Data Mining, [12] B. A. Davey and H. A. Priestley. Introduction to Lattices and Order, Fourth edition. Cambridge University Press, [13] E. G. Goodaire and M. M. Parmenter. Discrete Mathematics with Grah Theory, Second Edition. Prentice Hall, Uer Saddle River, NJ, [14] J. Han, J. Pei, and Y. Yin. Mining frequent atterns without candidate generation. In Proc. SIGMOD 00, ages 1-12, Dallas, TX, May 200 [15] T. W. Haynes, S. T. Hedetniemi, and P. J. Slater. Fundamentals of Domination in Grahs. Marcel Dekker, [16] R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Interest Measures, Kluwer Academic, Boston, [17] IBM. GenData, [18] M. Kamber and R. Shinghal. Evaluating the interestingness of characteristic rules. In Proc. KDD 96, ages , Portland, Oregon, [19] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivnen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. In Proc. of the 3 rd CIKM Conference, ages , November [20] M. Kryszkiewicz. Reresentative association rules and minimum condition maximum consequence association rules. In Proc. PKDD 98. ages , Nantes, France, [21] W. Lee, S. J. Stolfo, and K. W. Mokl. A data mining framework for building intrusion detection models. In Proc. IEEE Symosium on Security and Privacy, [22] G. Li. Basic Association Rules, M.Sc. Thesis, Deartment of Comuter Science, University of Regina, [23] W. Lin, S. A. Alvarez, and C. Ruiz. Collaborative recommendation via adative association rules mining. In Proc. WEBKDD 2000, San Francisco, CA, Aug. 200 [24] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Closed set based discovery of small covers for association rules. In Proc. of the 15 th Conference on Advanced Databases, ages , [25] C. M. Ricardo. Database Systems: Princiles, Design, & Imlementation, Macmillan, 199 [26] K. Satou, G. Shibayama, T. Ono, Y. Yamamura, E. Furuichi, S. Kuhara, and T. Takagi. Finding association rules on heterogeneous genome data. In Proc. of the Pacific Symosium on Biocomuting 97 (PSB 97), ages , Hawaii, Jan [27] R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. In Proc. KDD 97, ages [28] H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hatonen, and H. Mannila. Pruning and grouing discovered association rules. In Proc. ECML-95 Worksho on Statistics, Machine Learning, and Knowledge Discovery in Database, ages 47-52, Aril [29] UCI Machine Learning Reository. ft://ft.ics.uci.edu/ub/machine-learningdatabases/ [30] M. Zaki, Generating non-redundant association rules. In Proc. KDD 2000, ages August 200

Efficient discovery of statistically significant association rules

Efficient discovery of statistically significant association rules Wilhelmiina Hämäläinen Deartment of Comuter Science University of Helsinki Finland whamalai@cs.helsinki.fi ABSTRACT In this aer, we introduce