Efficient search of association rules with Fisher s exact test

fficient search of association rules with Fisher s exact test W. Hämäläinen Abstract limination of spurious and redundant rules is an important problem in the association rule discovery. Statistical measure functions like chi-squared, z-score and p-values offer a sound solution, but all of them are non-monotonic and computationally problematic. In addition, the approximate measures like chi-squared do not work well when relative frequencies are low. On the other hand, exact probabilities are hard to calculate for any large data set. We develop an efficient solution to this tricky problem. Nonmonotonicity is handled with a new algorithm, which also prunes redundant rules on-line. The quality of rules is estimated by Fisher s exact test. The initial experiments suggest that the method is both feasible and improves the quality of association rules remarkably. 1 Introduction The majority of data mining research has concentrated on scalable search methods. The main goal has been to handle always larger and larger data sets, while the quality of discovered patterns has often been forgotten. This is especially true of traditional association rule discovery. In the worst case all discovered rules can be spurious, while actually significant rules are missing [12, 11]. The problem is mainly computational, although there is debate how the significance of association rules should be estimated and whether it is even possible. In any case, statistical measure functions like χ 2, z-score and p-values can be used to select better rules. The problem is how to do this efficiently. A straight forward way is to first search frequent item sets in a traditional manner and use the measure functions afterwards to select the best rules. The problem is that the significance does not coincide with any absolute minimum frequency thresholds. If the threshold is set too high, many significant rules are missed. On the other hand, too low threshold produces a huge number of spurious and redundant rules, if the computation just finishes. However, in the same time the actual set of significant, non-redundant rules is relatively small. There is no need to traverse large epartment of Computer Science, University of Helsinki search spaces, if one just knows where to search. A natural alternative is to search the rules directly with the desired measure function. Unfortunately, none of these statistical measure functions is monotonic, i.e. a set of attributes can produce significant rules, even if its subset ( parent set ) cannot. However, a property does not have to be monotonic in all aspects, to enable efficient search. In the previous research [5], we have developed a new strategy for searching nonredundant significant association rules with a generic significance measure. The only requirement is that the measure function has a lower bound (an upper bound) which behaves monotonically i.e. either decreases or increases with the frequency and lift. Now we enlarge the algorithm for both positive and negative rules of form X A, X A, X A, X A. In practice, it is enough to search only positive correlations, but allow negation in the consequent. If rule X A expresses positive correlation, then X A expresses negative correlation. Fisher s exact test is a natural choice for the measure function, but it also is tricky to handle. Calculation is time consuming and can easily cause an overflow when any larger data set is processed. As a solution, we derive good approximations for the upper and lower bound of the p-value. The third new contribution is an efficient strategy for frequency counting. The technique is not bound to this special problem, but can be used in other algorithms as well. The idea to search both positive and negative rules is not new, but only few algorithms have been presented. In the approach by Wu et al.[13], negative rules were searched from a set of infrequent item sets. For pruning, they used a heuristic interestingness measure, in addition to minimum frequency thresholds. Antonie and Zaiane [2] did also use minimum frequency thresholds, but the strength of correlations was estimated by the Pearson correlation coefficient. According to authors analysis on one data set, the method produced interesting rules. The research by Koh and Pears [6, 7] is the most closely related to our research. They have also rejected absolute frequency thresholds and used Fisher s exact test as a significance criterion. The approach is quite

straigth forward application of the standard Apriori and no special strategies are used. The quality of discovered rules was not evaluated. Webb [12, 11] has also used Fisher s exact test to estimate the significance of association rules, but no new algorithm was introduced. According to empirical results, none of these methods is really scalable for large-scale data mining purposes. This is quite understandable since the problem is also demanding. In this sense, our algorithm can be considered really advantageous. An extra benefit is that the algorithm is general and suits for several measure functions. The rest of the paper is organized as follows: The basic definition are given in Section 2. In Section 3, we consider statistical significance measures and especially Fisher s exact test. The search algorithm and pruning strategies are introduced in Section 4. xperimental results are reported in Section 5 and the final conclusions are drawn in Section 6. The basic notations are introduced in Table 1. 2 Basic efinitions In the following we give basic definitions of the association rule, statistical dependence, and redundancy. 2.1 Association Rules Traditionally, association rules are defined in the frequency-confidence framework as follows: efinition 1. (Association rule) Let R be a set of binary attributes and r a relation according to R. Let X R, A R \ X, x om(x), and a om(a). The confidence of rule (X = x) (A = a) is cf(x = x A = a) = P (X=x,A=a) P (X=x) = P (A = a X = x) and the frequency of the rule is fr(x = x A = a) = P (X = x, A = a). Given user-defined thresholds min cf, min fr [0, 1], rule (X = x) (A = a) is an association rule in r, if (i) cf(x = x A = a) min cf, and (ii) fr(x = x A = a) min fr. The first condition requires that an association rule should be strong enough and the second condition requires that it should be common enough. Originally, it was assumed that the frequency corresponds the statistical significance of the rule [1]. In this paper, we call rules association rules, even if no thresholds min fr and min cf are specified. Usually, it is assumed that the rule contains only positive attribute values (A i = 1). Now, the rule can be expressed simply by listing the attributes, e.g. A 1, A 3, A 5 A 2. Rules of form X A, X A, or X A, are often called negative association rules, even if the negations can occur only in the front of either antecedent or consequent. 2.2 Statistical ependence Statistical dependence ( correlation ) is classically defined through statistical independence (e.g. [8]): efinition 2. Let X R and AinR \ X be sets of binary attributes. vents X = x and A = a, x om(x), a om(a), are mutually independent, if P (X = x, A = a) = P (X = x)p (A = a). If the events are not independent, they are dependent. The strength of the statistical dependence between (X = x) and (A = a) can be measured by lift or interest: (2.1) γ(x = x, A = a) = P (X = x, A = a) P (X = x)p (A = a). 2.3 Redundancy A common goal in association rule discovery is to find the most general rules (containing the minimal number of attributes) which satisfy the given search criteria. There is no sense to output complex rules X A, if their generalizations Z A, Z X are at least equally interesting. Generally, the goal is to find minimal (or most general) interesting rules, and prune out redundant rules [3]. In our case, the interestingness measure is the statistical significance, but in general, redundancy and minimality can be defined with respect to any other measure function. Generally, redundancy can be defined as follows: efinition 3. Given some interestingness measure M, rule X A is redundant, if there exists rule X A such that X {A } X {A} and M(X A ) M(X A). If the rule is not redundant, then it is called non-redundant. I.e. a rule is non-redundant, if all its generalizations ( parent rules ) are less significant. It is still possible that some or all of its specializations ( children rules ) are better. In the latter case, the rule is unlikely interesting itself. For negative association rules we extend the previous definition: efinition 4. Let M be like before. Rule X A is redundant, if there exists rule X A such that X {A } X {A} or X { A } X {A} or X {A } X { A} or X { A } X { A} and M(X A ) M(X A).

Non-redundant rules can be further classified as minimal or non-minimal: efinition 5. (Minimal rule) Non-redundant rule X A is minimal, if for all rules X A, such that X {A } X {A}, M(X A) M(X A ). I.e. a minimal rule is more significant than any of its parent or children rules. In the algorithmic level this means that we stop the search without checking any children rules, if we have just ensured that the rule is minimal. Once again, we can allow negation in the front of A or A. 3 Measuring the statistical significance The basic idea of statistical significance testing is to estimate the probability of the observed or more extreme event, under some null hypothesis. In the case of association rules, we have observed some frequency m(x, A) and lift γ(x, A) (or equivalently, the confidence, when m(a) is known). The null hypothesis is that rule X A is spurious,i.e. X and A are actually independent. If we want to measure the significance of positive correlation, we calculate the probability of observed or larger m(x, A), under the independence hypothesis. If the probability is very small, we can assume that the observation is not just due to chance, but X A represents a genuine pattern. So far everything should be clear. However, there are at least three ways to estimate the probability of the observed or rarer event under the independence assumption. All of them can be correct, depending on the purpose and situation. 3.1 Binomial probability, version 1 The simplest approach is to assume that actual P (X) and P (A) are the same as observed, and estimate the probability that X and A occur at least m(xa) times in the given data set (sample of size n), assuming that they are actually independent. This means that m(xa) can be at most min{m(x), m(a)}. Now it is enough to consider rows where X is true. On each row, A occurs with probability P (A), assuming the independence, and the probability of observing A at least m(xa) times from m(x) is p bin1 = m(x) i=m(x,a) ( m(x) i ) P (A) i (1 P (A)) n i. The probability can be estimated by the normal distribution, when we calculate the z-score: z(x A) = n(p (XA) P (X)P (A)). This is closely related to P (X)P (A)(1 P (A) the χ 2 -score, which is chi 2 (X A) = z 2 (X A) + z 2 ( X A). The problem of this measure is that the rules in the same data set cannot be compared. ach rule is tested in a different part of the data, where the antecedent X holds. According to our experiments, the results are very poor, if we use one absolute minimum threshold for all rules. As a solution, we can try to normalize the measure, and divide it by the maximum value np (X)(1 P (A)) P (A X) P (A) 1 P (A) P (X)P (A)(1 P (A). The result is, which happens to be Shortliffe s certainty factor [10], when γ > 1. The certainty factor can produce very accurate results when used for prediction purposes (e.g. [4]). However, it is very sensitive to redundant rules, because all rules with confidence 1.0 gain the maximum score, even if they occur on just one row. In addition, the certainty factor does not suit for branch-and-bound -style search, because the upper bound is always the same. 3.2 Binomial probability, version 2 There is another, better way to use the binomial probability for assessing the association rules. In this approach, we consider the whole data set and assume that on each row combination XA has probability P (X)P (A) to occur, if X and A are independent. Now we assume that P (X)P (A) is correct, under the independence assumption, but no assumptions are made on P (X) and P (A). Once again, the frequency of XA follows a binomial distribution: p bin2 = n ( n ) i=m(x,a) i (P (X)P (A)) i (1 P (X)P (A)) n i. The corresponding z-score is z(x A) = n(p (XA) P (X)P (A)) P (X)P (A)(1 P (X)P (A). Now all rules in one data set are comparable, and we can search all rules at a desired significance level (minimum z value). According to our experiments ([5]), the results are very good, compared to other measures. However, the z-score overestimates the significance, when the frequencies become low, which can both lead to suboptimal results and lengthen the search. As a results, we have used upper and lower bounds of the exact binomial probability to cut the search in time. Unfortunately, the binomial probability itself is not a monotonic function of P (A) and does not suit for our search algorithm as such. 3.3 Fisher s exact test A third alternative, used in Fisher s exact test, is to calculate the probability of the whole distribution of X and A. This approach is also taken in [11] for assessing the significance of association rules. When we estimate the significance of positive correlation between X and A (or equivalently, X and A), the probability of the observed or stronger correlation is mx m(x)!m( X)! m(a)!m( A)! n!(m(x, A) + i)!(m(x A) i)! (m( XA) i)!(m( X A) + i)!, i=0 where m = min{m(x, A), m( X, A)}. This is the same as the significance of negative correlation between X and A or X and A. Thus, we can easily

search both positive and negative correlation with the same measure, but testing the significance of positive correlations between X and A and X and A. The factorials in the equation can be problematic (causing an overflow), when the data set is any larger, but for search algorithm it is sufficient to have good upper and lower bounds. In addition, we can take logarithm of p which is often easier to calculate. 3.4 Lower bound for the Fisher probability A lower bound for p(x A) is achieved by considering the best possible rule, which can be derived from X. A tight lower bound can be derived from the Stirling inequality: n n 2πne Q1 e n < n! < nn 2πne Q2 e n, where Q1 = (12n + 1) 1 and Q2 = (12n) 1. Let A be the minimum attribute which can be added to X, i.e. P (A) P (B = b) or P ( A) P (B = b) for all B Y, b {0, 1}, and for all Y X. The best p-value is achieved, when both the frequency and the lift are maximal, i.e. m(x, A) = m(x) and γ = min{p (A), P ( A)} 1. There are two cases: 1. If m(x) == m(a), p = m(a)!m( A)! n! P (A) m(a)+0.5 P ( A) m( A)+0.5 2πne Q, where Q > 1. Therefore,we can omit e Q. 2. If m(x) < m(a), p = P ( X) m( X)+0.5 P (A) m(a)+0.5 (P (A) P (X)) m(a) m(x)+0.5 e Q. Once again Q > 1. m(a)!m( X)! n!(m(a) m(x))! The first lower bound is an increasing function of P (A), when P (A) < 0.5. Therefore, it can be used to derive a minimum frequency threshold: min fr = max{p (A) p(bestrule(a)) > max p }. 3.5 Upper bound for the Fisher probability An upper bound for p is achieved by multiplying the upper bound of the first term in the Fisher s sum equation by the number of terms. In the best case, when P (XA) = P (X) or P (XA) = P (A), there is just one term. Otherwise, the number of terms is tnum = min{m(x A), m( XA)} + 1. The first term can be estimated quite accurately using the Gosper s approximation: n! nn (6n + 1)π e n. 3 Now we can derive the following upper bounds: 1. When m(xa) < m(x) and m(xa) < m(a) and m( X A) > 0, p(x A) tnum ( ) m(xa) ( ) m( XA) P (X)P (A) P ( X)P (A) P (XA) P ( XA) ( ) m(x A) ( ) m( X A) P (X)P ( A) P ( X)P ( A) P (X A) P ( X A) (6m(X)+1)(6m( X)+1) (6n+1)(6m(XA)+1)(6m(X A)+1) (6m(A)+1)(6m( A)+1)3 (6m( XA)+1)(6m( X A)+1)π. 2. When m(xa) < m(x) and m(xa) < m(a) and m( X A) == 0, we know that m(x A) = m( A) and m( XA) = m( X). p(x A) ( m(xa) m( A) P (X) P (XA)) P (X) m( A) P (A) m(a). 3. When m(xa) = m(x) = m(a), p(x A) P (A) m(a) P ( A) m( A) (6m(A)+1)(6m( A)+1)π. (6n+1)3 4. When m(xa) = m(x) < m(a), p(x A) P (A) m(a) P ( X) m( X) (6m( X)+1)(6m(A)+1). (P (A) P (X)) m(a) m(x) (6m(A) 6m(X)+1)(6n+1) In practice we can group the equations into several parts by taking logarithm of p. This also helps to avoid the overflow. 4 Algorithm FishApriori implements a search algorithm for association rules containing both positive and negative correlations between positive attribute values. The main principle (a branch-and-bound search with a special enumeration tree) has been adapted from the StatApriori algorithm [5], which searches only significant positive correlations between positive attribute values. For FishApriori, we have developed a new strategy for handling negative correlations. In addition, we have improved the overall efficiency with a new effective technique for frequency counting. 4.1 The main principles The main pruning principle is based on the following observation: if measure I is an increasing function of lift γ(x, A), its upper bound is a decreasing function of P (A), and vice versa, when frequency P (XA) is fixed. This holds for several common measure functions, as stated in the classical axioms by Piatetsky-Shapiro [9]. In the following, we will assume that the measure function is increasing with significance, although the probability behaves contrary. (One can imagine, that the actual measure function is the reverse, p 1.) Now the significance is a monotonic property as long as the sets contain the same minimum attribute, Mina(X) = arg min{p (A) A X}.

Theorem 4.1. Let f M (fr, γ) be an increasing function of fr and γ and M(X A) = f M (P (XA), γ(x, A)). Then for any attribute sets Y, X Y we have (i) UB(Bestrule(X)) = f M (P (X), (Mina(X)) 1 ), and (ii) If Mina(X) Mina(Y ), then UB(Bestrule(Y )) = f M (P (X), (Mina(X)) 1 ). Proof. Follows from the observation γ(x A) 1 P (A) 1 min{p (A) A X}. Now it is enough to generate the attribute sets in such an order that the children sets (more special sets) have the same minimum attribute as their parents (more general sets). If the sets are generated in a correct order, each (l + 1)-set has at most one parent set with a different minimum attribute. In practice this means that property P S, potentially significant, is monotonic in all other relations. This can be achieved when the new attribute sets are generated according to an ordered enumeration tree. C B C A Figure 1: A complete enumeration tree (dash line) and an example tree (solid line). Figure 1) shows an example of an ordered enumeration tree, when R = {A, B, C,, }, and P (A) P (B)... P (). Solid lines represent an example tree, when two levels are generated, and dash lines show the nodes missing from the complete tree. Missing nodes at level 2 are non-p S and it is enough to generate new children sets from the existing P S nodes. Negative values in the consequents and the new frequency counting method complicate the algorithm, but the same idea can still be applied. Now the attributes are arranged in ascending order by min{p (A), P ( A)}. The nodes are classified into three categories according to their significance: 1. Set X is P S, potentially significant, if significant rules can be derived from X. This C B C is true, if UB(I(bestrule(X))) min I i(p (X), P (Mina(X) 1 ) min I. 2. Otherwise set X is non-p S. Such a set can still be a parent of a P S set in the previously mentioned special case, but no children are generated from it. (a) Set X is absolutely non-p S, if none of its children can be significant, i.e. i(p (X), minp) 1 ) < min I, where minp = min{p (A), P ( A) A R}. Such nodes can be removed from the tree immediately. (b) Set X is non-p S,but not absolutely non- P S, if i(p (X), P (Mina(X) 1 ) < min I but i(p (X), minp) 1 ) min I. This means that X can be a parent of a P S set, and it is saved until the next level is generated. 4.2 Frequency counting Frequency counting is the main bottle-neck of association rule discovery algorithms. It is especially time consuming in the breadthfirst search (which could otherwise be faster), because all l-candidates are stored and checked at once. This is also space consuming, which becomes the final burden, when there is no more memory left. In the depth-first search checking is usually done in smaller batches, even one by one, using some additional data structure. Constructing and storing the data structure takes time and space, too, but when the data set is dense, the strategy is usually more efficient. Our new approach combines the benefits of both strategies. The attribute sets are generated in breadthfirst manner, but the frequencies are checked immediately. If the candidate is useless, it can be deleted immediately. Normally, it would be hopeless to check candidates one by one from the data, but the new data structure changes the situation. The data is stored into a reversed bitmap, where rows and columns are reversed. Instead of n rows of k-bit vectors we have k rows of n-bit vectors. ach attribute A is implemented as a bit vector, which has bit 1 in the ith position, if A occurs on the ith row. The trick is that now we can count any frequencies by simple bit-and operations. The frequency of set X is the number of 1 bits in the bit-and vector of vectors A X. Counting the frequencies of all l-candidates in collection C l takes at most C l l n time instead of C l k n time. In practice, there are several ways to implement the bit-and operation and bit counting more efficiently. 4.3 Pseudocode The pseudocode of the FishApriori algorithm is give in Algorithms 4.1 and 4.2.

Algorithm 4.1. Algorithm FishApriori(R, r, max p ) Input: set of attributes R, data set r, threshold max p Output: minimal, significant rules Method: 1 determine min fr from max p 2 if (P (A) < min fr ) R = R \ {A} 3 k = R ; l = 1; 4 arrange attributes A R such that (P (A i ) min{p (A i+1, P ( A i+1 }) or (min fr P ( A i ) min{p (A i+1 ), P ( A i+1 )}) 5 for i = 0 to i < k add A i to C l 6 while ( C l l + 1) 7 for all X i, X j C l 8 Y = GenCand(X i, X j ) // is Y absolutely non-p S or redundant? 9 if ((LB(p(Y, A min )) max p ) and (LB(p(Y, A min )) max{z.p Z Y })) 10 C l+1 = C i+1 {Y } 11 count P (Y ) // is Y non-p S or redundant? 12 if (LB(p(Y, Mina(Y ))) > max p ) 13 Y.P S = 0 14 if (LB(p(Y, Mina(Y ))) max{z.p Z Y })) 15 Y.Red = 1; Y.p = max{z.p} 16 else calculate Y.p 17 mark if Y minimal 18 output minimal, significant rules Algorithm 4.2. Algorithm GenCand(X 1, X 2 ). Input: potentially significant l-sets X 1 and X 2 Output: (l + 1)-candidate Y Method: 1 if (P S(X 1 ) and ( Minimal(X 1 )) and ((P S(X 2 )) or (l == 2)) and ( Minimal(X 2 ))) 2 Y = X 1 X 2 3 if for all Z Y, ((Z C l ) and ((P S(Z)) or (Mina(Z) Mina(Y ))) and ( Minimal(Z))) 4 return Y 5 else return NULL 4.4 xample Let us take an example. An example data set is given in Table 2. The order of attributes is defined by P (A) = 0.05 < P (B) = 0.10 = P ( ) < P (C) = 0.20 < P () = 0.50. This means that in the first branch, under label A, the maximum possible lift is P (A) 1 = 20, in the second branch P (B) 1 = 10, in the third branch P ( ) 1 = 10 etc. The logarithm of the maximum p is 1. On the second level, all sets except C have been potentially significant. Figure 2 shows the enumeration tree, when the third level is beginning. Solid line nodes have been added to the tree and the frequencies are given in the nodes. 4 3 3 2 Table 2: xample data. m(abc ) = 2 m(ab C) = 1 m(abc ) = 1 m(a B C) = 1 m( AB C) = 1 m( ABC ) = 3 m( ABC ) = 2 m( A BC ) = 10 m( A BC ) = 2 m( A B C) = 42 m( A B C ) = 5 m( A B C) = 30 A 5 B C 4 3 2 C C C B 10 90 C C 7 8 5 2 Figure 2: An example enumeration tree, when the third level is processed. First, the algorithm generates candidate AB from AB and A. The third parent, B, is also in the tree which means that the candidate can be P S. The frequency of AB is counted and the best rule (with the highest lift) is generated. The rule is B A. The lower bound of ln(p) is 10.1. It is higher than the parent s ln(p), -7.6, and the rule is potentially nonredundant. The second set, ABC is also P S. However, its best rule, C A, has a lower ln(p) (-7.4) than its parent BC (-9.6) and the set is marked as redundant. The parents ln(p) is copied to the node. The next candidates AB, AC are P S, but not significant. Set A produces a non-redundant, significant rule A, with ln(p) = 2.1. Set AC is not generated, because one of its parents, C, is absolutely non-p S (and thus missing from the tree). The next set is BC and its best rule is BC with lower bound ln(p) = 7.4. However, this is lower than BC:s ln(p) and the rule is redundant. 2 15 C 45 20 50

B is P S but not significant. The last sets, BC and B cannot be significant because parent sets C and C are absolutely non-p S. On the fourth level, no more rules are generated. The program outputs five non-redundant, significant rules given in Table 3. For comparison, we have calculated also the exact ln(p) values, in addition to estimated upper bounds. With larger data sets the error is much smaller. Table 3: Rules derived from the example data. The last column gives the actual ln(p) and UB is the estimated lower bound. rule f r cf γ ln(u B(p)) ln(p) B C 0.08 0.80 4.0-9.6-10.6 A B 0.04 0.80 8.0-7.6-8.4 A 0.02 1.00 1.1-2.1-4.7 C 0.05 0.25 2.5-2.0-3.7 A C 0.03 0.60 3.0-1.9-2.9 5 xperiments FishApriori was tested with several large data sets and the results were compared to the traditional Apriori, using χ 2 -measure in the post processing phase. The Apriori program was an efficient implementation with a prefix-tree by C. Borgelt (FIMI repository, http://fimi.cs.helsinki.fi/src/). It produced only positive association rules and therefore we selected only positive correlations with the highest χ 2 -values. Redundancy reduction and rule ranking was implemented in a separate program. The data sets and test parameters are described in Table 4. All data sets were tested with two minimum confidence thresholds, 0.6 and 0.9. The goal was to test both the prediction accuracy (accuracy of strong rules) and robustness (genuineness of correlations). In the first case, the natural measure is the average test error. In the latter case, the prediction error can be large (if the method is robust, the expected error is 0.4), but the correlations should be at least as strong in the test set as in the learning set. To measure the robustness, we calculated the z- score for the difference of the lift distribution (characterized by the mean and variance) between the learning set and the test set: z( γ) = µ(γ L) µ(γ T ), σ 2 (γ L ) n L + σ2 (γ T ) n T where γ L is the lift variable in the learning set and γ T in the test set, µ and σ 2 notate the mean and variance, n L is the number of rules considered in the learning set, and n T is the number of these rules, which could be applied in the test set. The closer the z-score is to zero, the more similar the distributions are. If the score is negative, then the correlations are stronger in the test set than in the learning set. For example, Figures 3 and 4 show the lift distribution in a learning and test set of data set T40I10100K. The distributions are very similar which hints that the same correlations in both data sets. also very small, -0.1. #rules 35 30 25 20 15 10 5 0 <10 24 33 42 51 lift 60 69 The z-score was 78 87 >92 Figure 3: The lift distribution among 100 best rules in a learning set of T40I10100K. #rules 35 30 25 20 15 10 5 0 <10 24 33 42 51 lift 60 69 78 87 >92 Figure 4: The lift distribution among 100 best rules in a test set of T40I10100K. The first four data sets are classical benchmark data sets from FIMI repository for frequent item set mining (http://fimi.cs.helsinki.fi/). Plants lists all plant species growing in the U.S.A. and Canada. ach row contains a species and its distribution (states where it grows). The data has been extracted from the USA plants database (http://plants.usda.gov/ index.html). Garden lists recommended plant combi-

nations. The data is extracted from several gardening sources (e.g. http://baygardens.tripod.com/). All experiments were executed on Intel Core uo processor T5500 1.66GHz with 1 GB RAM, 2MG cache, and Linux operating system. The average execution times are given in Table 4. The quality of rules is summarized in Table 5. For brevity, we report only the 100 best rules. For each test we give average lift, z-score of the difference of the lift between the learning and test sets, prediction error in the test set, average frequency, confidence, rule length (number of attributes in the antecedent), and, for FishApriori, the number of rules with a negative consequent. All tests were repeated ten times with different learning and test sets, each time 2/3 of data in the learning set. Both measures behaved quite robustly, as expected, since both of them have been designed for measuring statistical dependencies. In the previous research we have found that this is not necessarily the sace with other measure functions. With Mushroom, FishApriori produced clearly smaller prediction error. The lift was also slightly higher. Mushroom generally quite easy set to find accurate rules. Chess was the only set where FishApriori found negative rules. When the confidence threshold was 0.6, it found 27 negative rules, but with confidence 0.9, just one. In the first case the prediction accuracy was quite poor. In the latter case, the prediction accuracy was already good, slightly better than with the traditional Apriori. We note that the traditional Apriori required extremely high minimum frequency threshold, 0.75, to be feasible. Partly this was due to extra post processing (redundancy reduction and ranking), but even without post processing, Apriori could not be run with min fr < 0.5. Another interesting observation is the lift, which was 1.0 for Apriori. In fact the strongest and most frequent rules in Chess are between independent events. Therefore, the correlation measures tend to select infrequent and weaker rules. T10I4100K and T40I10100K are quite similar sets, reminiscent of the real market-basket data. In T40I10100K, the transactions are larger and the computation is heavier. In T10I4100K, the traditional Apriori with χ 2 produced a smaller error with the lower confidence, but with the higher confidence, the errors were the same. With T40I10100K, the traditional Apriori performed clearly better. The prediction errors were smaller and also the lift was higher. Once again, FishApriori had problems to find enough significant rules, and the significance threshold was loosened. With Plants and Garden FishApriori outperformed the Apriori. The prediction errors were remarkably smaller, especially with the Garden. Garden is an especially difficult set, because all patterns (plant combinations involving certain varieties, colours, leaf types etc.) are very rare. Still FishApriori was able to select the genuine patterns among all rare combinations. The traditional Apriori produced a clearly higher lift. This is quite natural, because the lift favours rare items in the consequent, and the resulting rules can be too infrequent. With these data sets, FishApriori tested several potentially significant negative rules, but none of them belonged to the 100 best rules. 6 Conclusions We have considered the problem of discovering both positive and negative association rules efficiently. In the same time, the main objective has been the quality of discovered rules. As a solution, we have applied Fisher s exact test, and pruned redundant rules online. The resulting FishApriori algorithm extends our previous algorithm, but several new results were needed, before the main idea could be applied to negative rules. In addition, we had to derive computationally feasible upper and lower bounds for Fisher s p-value. The algorithm was tested carefully with several large data sets. The results were really encouraging. With most data sets, including computationally difficult dense data sets, the execution time was less a second. This is especially remarkable, because no minimum frequency thresholds were used and even the p value was set high enough so that at least 100 rules could be found. The efficiency was partly due to a new frequency counting method, which can be used to accelerate other algorithms, as well. The rule quality was also very good, both in the terms of prediction accuracy and robustness of correlations. However, significant negative correlations were relatively rare in the tested data sets. In the future research, our next goal is to tackle general association rules which can contain negations anywhere in the antecedent or consequent. The bit operations in the frequency could also be accelerated further. Garden data set motivates to develop techniques for searching significant rules among different levels of attribute hierarchies. References [1] R. Agrawal, T. Imielinski, and A.N. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMO International Conference on Management of ata, pages 207 216, Washington,.C., 26 28 1993.

[2] M.-L. Antonie and O. R. Zaïane. Mining positive and negative association rules: an approach for confined rules. In Proceedings of the 8th uropean Conference on Principles and Practice of Knowledge iscovery in atabases (PK 04), pages 27 38, New York, NY, USA, 2004. Springer-Verlag New York, Inc. [3] Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal. Mining minimal non-redundant association rules using frequent closed itemsets. In Proceedings of the First International Conference on Computational Logic (CL 00), volume 1861 of Lecture Notes in Computer Science, pages 972 986. Springer-Verlag, 2000. [4] F. Berzal, I. Blanco,. Sánchez, and M. Amparo Vila Miranda. A new framework to assess association rules. In Proceedings of the 4th International Conference on Advances in Intelligent ata Analysis (IA 01), pages 95 104. Springer-Verlag, 2001. [5] W. Hämäläinen and M. Nykänen. fficient discovery of statistically significant association rules. In Proceedings of the 8th I International Conference on ata Mining (ICM 2008), 2008. To appear. [6] Y. S. Koh. Mining non-coincidental rules without a user defined support threshold. In Advances in Knowledge iscovery and ata Mining, Proceedings of the 12th Pacific-Asia Conference (PAK 2008), volume 5012 of Lecture Notes in Computer Science, pages 910 915. Springer, 2008. [7] Y.S. Koh and R. Pears. fficiently finding negative association rules without support threshold. In AI 2007: Advances in Artificial Intelligence, Proceedings of the 20th Australian Joint Conference on Artificial Intelligence (AI 2007), volume 4830 of Lecture Notes in Computer Science, pages 710 714. Springer, 2007. [8] R. Meo. Theory of dependence values. ACM Transactions on atabase Systems, 25(3):380 406, 2000. [9] G. Piatetsky-Shapiro. iscovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W.J. Frawley, editors, Knowledge iscovery in atabases, pages 229 248. AAAI/MIT Press, 1991. [10].H. Shortliffe and B.G. Buchanan. A model of inexact reasoning in medicine. Mathematical Biosciences, 23:351 379, 1975. [11] G. I. Webb. iscovering significant patterns. Machine Learning, 68(1):1 33, 2007. [12] G.I. Webb. iscovering significant rules. In Proceedings of the 12th ACM SIGK international conference on Knowledge discovery and data mining (K 06), pages 434 443, New York, USA, 2006. ACM Press. [13] X. Wu, C. Zhang, and S. Zhang. fficient mining of both positive and negative association rules. ACM Transactions on Information Systems, 22(3):381 405, 2004.

Table 1: Basic notations. Notation Meaning A, B, C,... binary attributes a, b, c,... {0, 1} attribute values R = {A 1,..., A k } set of all attributes R = k number of attributes in R om(r) = {0, 1} k attribute space X, Y, Z R attribute sets om(x) = {0, 1} l om(r) domain of X, X = l (X = x) = {(A 1 = a 1 ),..., (A l = a l )} event, X = l t = {A 1 = t(a 1 ),..., A k = t(a k )} row r = {t 1,..., t n t i om(r)} relation (data set) r = n size of relation r σ X=x (r) = {t r t[x] = x} set of rows where X = x m(x = x) = σ X=x (r) absolute frequency of X = x P (X = x) = m(x=x) n relative frequency of X = x (X = x) (A = a) association rule X A association rule containing negations only X A in the consequent P (X A) = P (XA) n frequency of the rule P (A X) = P (XA) confidence of the rule P (X) P (XA) P (X)P (A) γ(x, A) = UB(M) LB(M) Bestrule(X) = arg max A X {M(X \ A A)} Mina(X) = arg min{p (A) A X} P S(X) Red(X) Minimal(X) Signif icant(x) lift of the rule an upper bound for M a lower bound for M the best rule which can be derived from X the least frequent attribute in X is X potentially significant? P S(X) = 1, if UB(M(Bestrule(X))) min M is X redundant? Red(X) = 1, if Z X M(Bestrule(Z)) M(Bestrule(X)) is X minimal? Minimal(X) = 1, if Y X M(Bestrule(Y )) M(Bestrule(X)) is X significant? Signif icant(x) = 1, if M(Bestrule(X)) min M

Table 4: ata sets, test parameters, and average execution time in seconds. The natural logarithm of the maximum p-value, max lnp, was used in FishApriori and minimum frequency threshold, min fr, in the traditional Apriori. For Apriori, search and post processing times are given separately. FishApriori TradApriori data n k min cf max lnp time min fr time 1a Mushroom 8124 120 0.6-800 0 0.24 0+40 1b Mushroom 8124 120 0.9-800 0 0.22 1+440 2a Chess 3196 75 0.6-60 0 0.75 < 1+110 2b Chess 3196 75 0.9-50 0 0.75 <1+600 3a T10I4100K 100000 1000 0.6-1500 8 0.001 1+180 3b T10I4100K 100000 1000 0.9-1500 8 0.001 1+150s 4a T40I10100K 100000 1000 0.6-2000 47 0.01 6+160 4b T40I10100K 100000 1000 0.9-2000 47 0.01 6+100 5a Plants 226632 70 0.6-100 0 0.12 45+4 5b Plants 22632 70 0.9-100 0 0.12 42+4 6a Garden 2235 2372 0.6-30 0 0.001 0 6b Garden 2235 2372 0.9-30 0 0.001 0 Table 5: Quality of 100 best rules found by FishApriori and the traditional Apriori. The parameters are: average lift, z-score of the difference in lift, average prediction error, frequency, confidence, rule length (number of attributes in the antecedent), and for FishApriori, the number of negative rules. FishApriori ln(p) Apriori min fr +χ 2 test lift z γ err fr cf len #neg lift z γ err fr cf len 1a 3.6 0.1 0.07 0.2076 0.93 2.0 0 2.3 0.0 0.18 0.3009 0.83 3.0 1b 4.5 0.1 0.02 0.1847 0.98 2.5 0 3.1-0.3 0.04 0.2550 0.96 3.1 2a 2.2-0.0 0.30 0.2689 0.70 1.0 27 1.0 0.1 0.09 0.8349 0.91 2.2 2b 1.4-0.1 0.04 0.2975 0.86 0.9 2 1.0 0.1 0.05 0.8387 0.95 2.1 3a 96.2-0.2 0.13 0.0060 0.88 1.9 0 736.4-0.5 0.08 0.0011 0.92 2.7 3b 82.9-0.2 0.04 0.005 0.94 1.3 0 736.0-0.2 0.04 0.0012 0.96 2.8 4a 38.2-0.1 0.18 0.0136 0.82 2.0 0 68.9 0.2 0.05 0.0104 0.96 3.0 4b 39.7 0.0 0.08 0.0112 0.92 2.2 0 60.7 4.3 0.02 0.0089 0.84 3.2 5a 4.0 0.1 0.14 0.1618 0.86 1.0 0 2.1 0.1 0.22 0.1289 0.79 2.1 5b 4.8 0.4 0.07 0.1126 0.93 1.6 0 5.3 0.2 0.09 0.1264 0.92 3.1 6a 124.1 2.7 0.20 0.0005 0.33 0.5 0 573.1 1.3 0.65 0.0014 0.92 1.7 6b 124.9 2.8 0.19 0.0005 0.33 0.5 0 589.4 0.8 0.40 0.0014 1.00 1.7