Efficient discovery of statistically significant association rules

Size: px

Start display at page:

Download "Efficient discovery of statistically significant association rules"

Constance Ward
6 years ago
Views:

1 fficient discovery of statistically significant association rules Wilhelmiina Hämäläinen Department of Computer Science University of Helsinki Finland Matti Nykänen Department of Computer Science University of Kuopio Finland Abstract Searching statistically significant association rules is an important but neglected problem. Traditional association rules do not capture the idea of statistical dependence and the resulting rules can be spurious, while the most significant rules may be missing. This leads to erroneous models and predictions which often become expensive. The problem is computationally very difficult, because the significance is not a monotonic property. However, in this paper we prove several other properties, which can be used for pruning the search space. The properties are implemented in the StatApriori algorithm, which searches statistically significant, non-redundant association rules. Based on both theoretical and empirical observations, the resulting rules are very accurate compared to traditional association rules. In addition, StatApriori can work with extremely low frequencies, thus finding new interesting rules. 1. Introduction Traditional association rules [1] are rules of form if event X = x occurs, then also event A = a is likely to occur. The commonness of the rule is measured by frequency P (X = x, A = a) and the strength of the rule by confidence P (A = a X = x). For computational purposes it is required that both frequency and confidence should exceed some user-defined thresholds. The actual interestingness of the rule is usually decided afterwards, by some interestingness measure. Often the associations are interpreted as correlations or dependencies between certain attribute value combinations. However, traditional association rules do not necessarily capture statistical dependencies, but they can associate absolutely independent events while ignoring strong dependencies. As a solution, it is often suggested (following the axioms by Piatetsky-Shapiro [18]) to measure the lift (interest) instead of the confidence (e.g. [21]). This produces also statistically more sound results, but still it is possible to find spurious rules while missing statistically significant rules. Often, these two error types are called type 1 and type 2 errors (in computer science terms, false positive and false negative). In the worst case, all discovered rules can be spurious [23, 24]. In practice, this means that the future data does not exhibit the discovered dependencies and the conclusions based on them are erroneous. The results can be expensive or even fatal, as the following example demonstrates. xample 1. A biological database contains observation reports from different kinds of biotopes, like grove, marsh, waterside, coniferous forest, etc. For association analysis, each report is represented as a binary vector, listing the observed species, along with biotope characteristics. Local forestry societies as well as individual land owners can use the data when they decide e.g. fellings or protected sites. The forestry society FallAll is going to drain swamps for new forests. Before any decisions are made, they search associations from the 1000 observations on marsh sides. They use minimum frequency 0.05 and minimum confidence One discovered rule is leather leaf cloudberry with frequency 0.06 and confidence Since cloudberries are commercially important product, the forestry society decides to protect a marsh growing leather leaves, when other swamps are drained. The decision is excellent for the leather leaf, but all cloudberries in the area disappear. The reason is that cloudberries require a wet swamp, while leather leaves can grow in both moist and wet sides. The only protected swamp in the area was too dry for cloudberries. This catastrophe was due to a spurious rule leather leaf cloudberry. The rule has p-value 0.13 which means that there 13% probability to make a type 1 error. In the same time, the forest society misses an important rule, namely wet swamp,leather leaf cloudberry. This rule was not found, because it had too low frequency, However, it is a strong rule with confidence 1.0. The p-

2 value is which indicates that the rule is quite reliable. Roughly speaking, it means that there is only 1.1% probability that the rule is spurious. The problems of association rules and especially the frequency-confidence-framework are well-known ([23, 24, 6, 16]), but still there have been only few attempts to solve the problem. Quite likely, the reason is purely practical: the problem has been considered computationally intractable. Statistical significance is not a monotonic property and therefore it cannot be used for pruning the search space in the same manner as the frequency. However, when we search directly statistically significant rules (instead of sets), we can utilize other properties for efficient pruning. More efficiency is achieved by searching only minimal (non-redundant) statistically significant rules. Such rules are at least as good as the pruned rules, but simpler, and no information is lost. In practice, the simpler rules avoid overfitting and hold better in the future data. In this paper, we introduce a set of properties which can be used for searching minimal, statistically most significant association rules. The properties are implemented in the StatApriori algorithm. StatApriori guarantees that no significant rules are missed, while the number of spurious rule candidates generated during the execution is kept minimal. Compared to a modification of the classical Apriori algorithm, which also produces all significant association rules, StatApriori is very efficient. It can tackle problems which are impossible to compute with the classical Apriori. As far as we know, the algorithm is the first of its kind. The previous algorithms have restricted in statistically significant classification rules using χ 2 -measure (e.g. [15, 16, 4, 22, 17]). This is quite a different problem, because in classification both X C and X C should be accurate. In other words, classification rules describe dependencies between attributes, while association rules describe dependencies between events. Webb [24] has done pioneering work in testing the statistical significance of association rules by Fisher s exact test. Fisher s exact test like χ 2 is designed for measuring dependence between attributes and some significant association rules can be missed (type 2 error). However, no new algorithms were introduced in these experiments and the search proved to be infeasible on large data sets with the existing techniques. In addition, several interestingness measures (see e.g. [9] for an overview) have their origins in statistics, but they do not measure the statistical significance of association rules. As a result, they can cause both type 1 and type 2 errors. The organization of the paper is the following: The basic definitions are given in Section 2. The main principles of the search are given in Section 3 and the StatApriori algorithm in Section 4. xperimental results are reported in Section 5 and the final conclusions are drawn in Section Basic definitions In the following we give basic definitions of the association rule, statistical dependence, statistical significance, and redundancy. The notations are introduced in Table 1. Table 1. Basic notations. Notation Meaning A, B, C,... binary attributes a, b, c,... {0, 1} attribute values R = {A 1,..., A k } set of all attributes R = k number of attributes Dom(R) = {0, 1} k attribute space X, Y, Z R attribute sets Dom(X) = {0, 1} l domain of X, X = l (X = x) = {(A 1 = a 1 ),..., event, X = l (A l = a l )} t = {A 1 = t(a 1 ),..., A k = t(a k )} row (tuple, transaction) r = {t 1,..., t n t i Dom(R)} relation (data set) r = n size of relation r σ X=x (r) = {t r t[x] = x} set of rows where X = x m(x = x) = σ X=x (r) number of rows where X = x P (X = x) = m(x=x) relative frequency of n X = x i(fr, γ) measure functions I(X A) = i(p (XA), γ(x, A)) upperbound(f) an upperbound for function f bestrule(x) = the best rule which can be arg max A X {I(X \ A A)} constructed from X P S(X) property potentially significant ; whether significant rules can be derived from X or its supersets minattr(x) = minimum attribute of X; arg min{p (A i ) A i X} one with the lowest frequency 2.1. Association rules Traditionally, association rules are defined in the frequency-confidence framework: Definition 1 (Association rule). Let R be a set of binary attributes and r a relation according to R. Let X R, A R \ X, x Dom(X), and a Dom(A). The confidence of rule (X = x) (A = a) is cf(x = x A = a) = P (X = x, A = a) P (X = x) = P (A = a X = x) and the frequency (support) of the rule is fr(x = x A = a) = P (X = x, A = a). Given user-defined thresholds min cf, min fr [0, 1], rule (X = x) (A = a) is an association rule in r, if 2

3 (i) cf(x = x A = a) min cf, and (ii) fr(x = x A = a) min fr. The first condition requires that an association rule should be strong enough and the second condition requires that it should be common enough. In this paper, we call rules association rules, even if no thresholds min fr and min cf are specified. Usually, it is assumed that the rule contains only positive attribute values (A i = 1). Now, the rule can be expressed simply by listing the attributes, e.g. A 1, A 3, A 5 A Statistical dependence Statistical dependence is usually defined through statistical independence (e.g. [20, 14]): Definition 2 (Independence and dependence). Let X R and A R \ X be sets of binary attributes. vents X = x and A = a, x Dom(X), a Dom(A), are mutually independent, if P (X = x, A = a) = P (X = x)p (A = a). If the events are not independent, they are dependent. The strength of the statistical dependence between (X = x) and (A = a) can be measured by lift or interest: γ(x = x, A = a) = P (X = x, A = a) P (X = x)p (A = a). In the following, we will concentrate on the dependencies between events containing only positive attributes. The lift of rule X A is denoted simply γ(x, A) Statistical significance In this work, we analyze the statistical significance of association rules in the classical (frequentist) framework. Bayesian significance testing offers an interesting alternative, but it is still little studied in this context. Both approaches produce asymptotically similar results (under some assumptions, the test results are identical), although the Bayesian testing is sensitive to the selected prior probabilities. [3] The idea of classical statistical significance tests (see e.g. [8, Ch. 26] or [12, Ch. 10.1]) is to estimate the probability of the observed or a rarer phenomenon, under some null hypothesis. When the objective is to test the significance of the dependency between events X and A, the null hypothesis is the independence assumption: P (X, A) = P (X)P (A). The task is to calculate the probability p that the observed or higher frequency occurs in the data, if the events were actually independent. If the estimated probability p is very small, then the observed dependency is said to be significant at level p. The significance of the observed frequency m(x, A) can be estimated exactly by the binomial distribution. ach row in relation r, r = n, corresponds to an independent Bernoulli trial, whose outcome is either 1 (XA occurs) or 0 (XA does not occur). All rows are mutually independent. Assuming the independence of attributes X and A, combination XA occurs on a row with probability P (X)P (A). Now the number of rows containing X, A is a binomial random variable M with parameters P (X)P (A) and n. The mean of M is µ M = np (X)P (A) and its variance is σm 2 = np (X)P (A)(1 P (X)P (A)). Probability P (M m(x, A)) gives the significance p: p = m(x) i=m(x,a) ( n i ) (P (X)P (A)) i (1 P (X)P (A)) n i. The significance can be used in two ways to prune association rules: either 1) we can set the significance level (maximum p) and search all rules with sufficiently low p, or 2) use the p-values to search the K most significant rules. Deciding the required significance level is a difficult problem, which we do not try to solve here. The problem is that the more rules we test, the more spurious rules are likely to pass the significance test. Webb [24] has suggested a solution in the context of association rule discovery using the Bonferroni adjustment [19] The z-score The binomial probability is quite difficult to calculate, but for our purposes it is enough to have an upper bound for the p-value. This guarantees that no rules with a low p- value are lost when the search space is pruned. Additional pruning and ranking can be done afterwards, when the actual binomial probabilities are calculated. The simplest upper bound is based on the (binomial) z- score: z(x, A) = m(x, A) µ M = = σ M m(x, A) np (X)P (A) np (X)P (A)(1 P (X)P (A)) n(γ(x, A) 1) γ(x, A) P (X, A). The z-score measures how many standard deviations (σ M ) the observed frequency m(x, A) deviates from the expected value µ M = np (X)P (A). The corresponding probability can be easily approximated, because z follows the standard normal distribution, when n is sufficiently large and P (X)P (A) (or 1 P (X)P (A)) is neither close to 0 nor to 1. As a rule of thumb, the approximation can be used, 3

4 when np (X)P (A) 5 (e.g. [12, 147]). According to [7], the approximation works well even for np (X)P (A) 2, if continuity correction (subtracting 0.5 from m(x, A)) is used. When P (X)P (A) is low, the binomial distribution is positively skewed. This means that the z-score overestimates the significance. Therefore, we will not use normal approximation to estimate the p-values, but the z-score is used only as a measure function. We note that the z-score is not crucial to our method, but several other measure functions can be used, as well. The requirement is that the measure I is a monotonically increasing or decreasing function of m(x, A) and γ(x, A). For example, when the expected value P (X)P (A) is very low, we can derive a tight upperbound for p from the Chernoff bound [10]: e δµ M P (M > µ M (1 + δ) <. (1 + δ) (1+δ)µ M By inserting δ = γ 1, where γ = γ(x, A), and using γµ = m(x, A), we achieve p ch = P (M > m(x, A)) < (e (γ 1) γ γ ) m(x,a). This is monotonically decreasing with both m(x, A) and γ Redundancy A common goal in association rule discovery is to find minimal (or most general) interesting rules, and prune out redundant rules [5]. The reasons are twofold: First, the number of discovered rules is typically too large (even hundreds of thousand rules) for any human interpreter. According to the Occam s Razor principle, it is only sensible to prune out complex rules X A, if their generalizations Z A, Z X are at least equally interesting. The user just has to define the interestingness measure carefully, according to the modelling purposes. Second, pruning redundant rules can save the search time enormously, if it is done on-line. This is not possible with many interestingness functions, and usually the pruning is done afterwards. In our case, the interestingness measure is the statistical significance, but in general, redundancy and minimality can be defined with respect to any other measure function. Definition 3 (Redundant rules). Given an increasing interestingness measure I, rule X A is redundant, if there exists rule X A such that X {A } X {A} and I(X A ) I(X A). If the rule is not redundant, then it is called non-redundant. I.e. a rule is non-redundant, if all its generalizations ( parent rules ) are less significant. It is still possible that some or all of its specializations ( children rules ) are better. In the latter case, the rule is unlikely interesting itself. Non-redundant rules can be further classified as minimal or non-minimal: Definition 4 (Minimal rules). Non-redundant rule X A is minimal, if for all rules X A, such that X {A } X {A}, I(X A) I(X A ). I.e. a minimal rule is more significant than any of its parent or children rules. In the algorithmic level this means that we stop the search without checking any children rules, if we have just ensured that the rule is minimal. 3. Main principles In this section, we will introduce the main principles of the search algorithm. The results are given on such a general level that any suitable measure function or search strategy can be applied Problem definition Let us first define the problem formally: Definition 5 (Search problem). Let p(x A) denote the p-value of rule X A. Given binary data r and threshold p max R, the problem is to search a set of association rules S such that for all X A S 1. X A expresses a positive correlation, i.e. γ(x A) > 1, 2. X A is non-redundant, 3. for all Y B / S, p(x A) p(y B), and 4. p(x A) p max. We note that the user has to select only one parameter, p max. Alternatively, we could define an optimization problem, where the N best rules (with lowest p-values) are searched. Let us now assume that we have a measure function i(f r, γ) which defines an upperbound for the binomial probability. For any rule X A, I(X A) = i(p (XA), γ(x, A)). In addition, let i be either monotonically increasing or decreasing with both frequency f r and lift γ. The search problem can be divided into two subproblems: 1. Search all non-redundant rules X A for which p(x A) p max using i. 4

5 Figure 1. The general Apriori algorithm. Input: set of attributes R, data set r, an anti-monotonic property π Output: {X P(R) π(x) = 1} Method: // Initialization S 1 = {A i R π(a i ) = 1} l = 1 while (S l ) // Step 1: Candidate generation Generate C l+1 from S l // Step 2: Pruning S l+1 = {c C l+1 π(c) = 1} l = l + 1 return l S l 2. Calculate the exact p-values and output rules with sufficiently low p. The postprocessing step is trivial and we will concentrate on only the search step. For simplicity, we assume that i is monotonically increasing; a monotonically decreasing measure function is handled similarly. The measure function I guarantees that all significant rules at level p max are discovered. For efficiency, i should also prune out as many insignificant or redundant rules as possible Monotonic and anti-monotonic properties The key idea of the classical Apriori algorithm [2, 13] is the anti-monotonicity of frequency. For attribute sets, the monotonicity and anti-monotonicity are defined as follows: Definition 6 (Monotonic and anti-monotonic properties). Property π : P(R) {0, 1} is monotonic, if (π(y ) = 1) (π(x) = 1) for all X Y, and anti-monotonic, if (π(x) = 1) (π(y ) = 1) for all Y X. When π is anti-monotonic (π(y ) = 0) (π(x) = 0) for all X Y. When the measure function defines an anti-monotonic property, the interesting sets or rules can be searched with the general Apriori algorithm (Figure 1). The problem is that the measure functions for the statistical significance do not define any anti-monotonic property. However, it turns out that the upper-bound for the measure function I defines an anti-monotonic property for most set-inclusion relations Property P S Let us define the property P S, potentially significant. Potential significance of set X is a necessary condition for constructing any significant rule X \ A A. Definition 7. Let measure function I be as before, min I a user-defined threshold, and upperbound(f) an upperbound for function f. Let bestrule(x) = arg max A X {I(X \ A A)} be the best rule which can be constructed from attributes X. Property P S : P(R) {0, 1} is defined as P S(X) = 1, iff upperbound(i(bestrule(x))) min I. Now it is enough to define the conditions under which P S behaves anti-monotonically. The following theorem is the core of the whole search algorithm: Theorem 1. Let P S, X, and Y be as before. If (P S(X) = 1), then (P S(Y ) = 1) for all Y X such that minattr(x) = minattr(y ). Proof. First observe that for all A X we have γ(x \ A, A)) 1 P (A) 1 P (minattr(x)) and upperbound(i(x \ 1 A A)) = i(p (X), P (minattr(x)) ). By them we have upperbound(i(bestrule(x))) = 1 i(p (X), P (minattr(x)) ) i(p (Y ), 1 P (minattr(y )) ) = upperbound(i(bestrule(y ))) for all Y X such that minattr(x) = minattr(y ). We have min I upperbound(i(bestrule(x))) by the definition of P S(X) = 1. Hence the reasoning above yields also min I upperbound(i(bestrule(y ))), as required for the definition of P S(Y ) = 1. Corollary 1. If P S(Y ) = 0, then P S(X) = 0 for all X Y such that minattr(x) = minattr(y ). We have shown that property P S defines an antimonotonic property among sets having the same minimum attribute. Let us now consider the exceptional case, when the anti-monotonicity does not hold. Let us call X an l-set, if X = l. Let the (l 1)-subsets, Y i X, Y i = l 1, be called X s parent sets. Now X has l parent sets, from which only one has a different minimum attribute. The exceptional parent set is Y l = X \ {minattr(x)}. If P (minattr(y l )) > P (minattr(x)), Y l has a lower upperbound for γ than Y 1,..., Y l 1 and X have. Therefore, it is possible that Y l is non-p S, even if X is P S. 4. Algorithm Next, we give the StatApriori algorithm, which implements the pruning properties. 5

6 4.1. The main idea The StatApriori algorithm proceeds like the general Apriori (Figure 1) using property P S. It alternates between the candidate generation and pruning steps, as long as new non-redundant, potentially significant rules can be found. However, special techniques are needed, because property P S is not anti-monotonic in all respects. First, the attributes are arranged into an ascending order by their frequencies. Let the renamed attributes be {A 1,..., A k }, where P (A 1 )... P (A k ). The idea is that the candidates are generated in the canonical order. From l-set X = {A 1,..., A l }, we can generate (l + 1)-sets X {A j }, where j > l. Now all supersets of X have the same upper-bound for the lift, γ 1 P (A 1 ). If X is non-p S, then none of its descendants can be P S. Otherwise, we should check the other parent sets Z X {A j }, Z = l. If at most one of them, (X \ {A 1 }) {A j }, is non-p S, then X {A j } is added to the candidate collection C l+1. If (X \ {A 1 }) {A j } was non-p S, it is also added to a temporary collection of special parents for frequency counting. After candidate generation, the exact frequencies are counted from the data. Candidates which are non-p S or can produce only redundant descendants, will be pruned, and others are added to collection S l+1. The minimality of P S rules is checked, because no new candidates are generated from minimal rules. The principles for redundancy and minimality checking are 1. If we have γ(bestrule(x)) = 1 P (minattr(x)), then the lift is already maximal possible, and none of X s specializations can gain a better p-value. The rule is marked as minimal. 2. If we have upperbound(i(bestrule(x {A j }))) i(bestrule(z)) for some attributes Z X {A j }, then X {A j } and all its specializations will be redundant with respect to Z. X {A j } is removed numeration tree The secret of StatApriori is a special kind of enumeration tree, which enables an efficient implementation of pruning principles. A complete enumeration tree lists all sets in the powerset P(R). In practice, it can be implemented as a trie, where each root node path corresponds an item set. StatApriori uses an ordered enumeration tree, where the attributes are arranged into ascending order by their frequencies. Figure 2) shows an example, when R = {A, B, C, D, }, and P (A) P (B)... P (). Solid lines represent an example tree, when two levels are generated, and dash lines show the nodes missing from a complete tree. Let us now consider the candidate generation at the third level. The missing nodes at the second level are either insignificant or they and all their descendants are redundant. Set {A, B, C} is added to the tree, because all its parent sets, {A, B}, {A, C}, and {B, C} are in the tree (i.e. P S) and non-minimal. Sets {A, B, D} and {A, B, } are not added, because they have non-p S parents (missing sets {A, D} and {A, }) with the same minimum attribute A. The same holds for {A, C, D} and {A, C, }. However, {B, C, D} is added to the tree, because the only missing parent, {C, D}, has a different minimum attribute. This is a special case, and also {C, D} is added to a temporary collection for frequency counting. Sets {B, C, } and {B, D, } are not added, because {B, } is missing. After frequency counting, non-p S candidates will be removed from the tree. D C B C D D D D A Figure 2. A complete enumeration tree (dash line) and an example tree (solid line) Pseudocode The StatApriori algorithm is given in Figures 3, 4, and 5. In the pseudocode it is assumed that the measure function I for statistical significance is increasing Time complexity It is known that the problem of searching all frequent attribute sets is NP -hard, in the terms of the number of attributes, k [11]. The worst case happens, when the most significant association rule involves all k attributes, and all 2 k attribute sets are generated. The worst case complexity of the algorithm is O(max{k 2, nk}2 k ). Usually, when k < n, this reduces to O(n 2 2 k ). Theorem 2. The worst-case time complexity of StatApriori is O(max{k 2, nk}2 k ), where n is the number of rows and k is the number of attributes. C D B C D D 6

7 Figure 3. Algorithm StatApriori. Input: set of attributes R, data set r, increasing measure function I, threshold K Output: non-redundant rules which are significant Method: // Initialization 1 S 1 = {A i R P S(A i ) = 1} 2 l = 1 3 while (S l ) 4 // Step 1: Candidate generation 5 C l+1 =GenCands(S l, l) 6 // Step 2: Pruning 7 S l+1 =PruneCands(C l+1, l + 1, K) 8 l = l for all X i l S l such that ( X i.redundant) and (X i.max I K) 10 output bestrule(x i ) Figure 5. Algorithm PruneCands. Input: l-candidates C l, size l, threshold K Output: potentially significant l-sets S l Method: 1 S l = ; 2 for all X i C l, Y j SpecP ar 3 count frequencies P (X i ) and P (Y j ) from r 4 for all X i C l 5 max γ = P (minattr(x)) 1 // check if X i is P S and its descendants can // produce non-redundant rules 6 if (PS(P (X i ), max γ, K) and ( Redundant(P (X i ), max γ )) 7 add X i to S l 8 if i(bestrule(x i )) == upperbound(i(bestrule(x i ))) 9 X i.minimal = 1 // check if X i is redundant; its descendants can // still be non-redundant 10 if (Redundant(P (X i ), γ(bestrule(x i )))) 11 X i.redundant = 1 12 X i.max I = max{y j.max I Y j P arents(x i )} 13 return S l ; Figure 4. Algorithm GenCands. Input: potentially significant l-sets S l, size l Output: (l + 1)-candidates C l+1, special parents SpecP ar Method: 1 C l+1 = ; SpecP ar = ; // X i and X j have l 1 common attributes, the same // minimum attribute and none of them is minimal 2 for all X i, X j S l such that (( X i X j == l 1) and Minimal(X i ) and Minimal(X j )) // check the other parents with the same minattr 3 if Z X i X j such that (( Z = l) and (minattr(z) == minattr(x i )) if ( Minimal(Z) and Z S l 1 ) 4 add X i X j to C l+1 // check special case 5 if ((X i X j ) \ minattr(x i ) / S l ) 6 add (X i X j ) \ minattr(x i ) to SpecP ar 7 return (C l+1, SpecP ar) Proof. The initialization (generation of 1-sets) takes n k steps. Producing l-sets and their best rules takes l 2 C l + 2n C l k + 2 S l l time steps. The first term is the time complexity of the candidate generation. ach candidate has l parents and each parent can be found in the trie in l 1 steps. The second term is the complexity of the frequency counting. The database is read (n rows) and on each row C l candidates are checked. In the worst case, all candidates have an extra parent which has to be checked, too. Checking takes in the worst case log k steps, when the data is stored as bit vectors and inclusion test is implemented with logical bit operations. The third term is the complexity of the rule selection phase: for each of S l sets, all l parents are checked. Checking is done at most twice: once for calculating the maximal I-value (selecting the best rule) and second time for checking the redundancy. ach checking can be implemented in constant time, if the parent pointers are stored into a temporary structure in candidate generation phase. Since C l S l, the total complexity is k max{l 2, n} C l < max{k 2, n} l=2 k l=2 ( ) k l = O(max{k 2, nk}2 k ). 7

8 5. xperiments The main goal of the experiments was to evaluate the speed accuracy ratio of the StatApriori algorithm. ven a clever algorithm is worthless unless it can produce better results or perform faster than the existing methods. It was expected that StatApriori cannot compete in speed with the traditional methods, but instead it is likely to produce more accurate rules. The data sets and test parameters are described in Table 2. The first four data sets are classical benchmark data sets from FIMI repository for frequent item set mining (http: //fimi.cs.helsinki.fi/). Plants lists all plant species growing in the U.S.A. and Canada. ach row contains a species and its distribution (states where it grows). The data has been extracted from the USDA plants database ( Garden lists recommended plant combinations. The data is extracted from several gardening sources (e.g. baygardens.tripod.com/). ach data set was tested with two minimum confidences, 0.90 and The goal was to find both strong (and probably accurate) rules and strong correlations. For all tested measures we have calculated the average prediction accuracy (error in the test set) and lift among 100 best rules during 10 executions. All experiments were executed on Intel Core Duo processor T GHz with 1 GB RAM, 2MG cache, and Linux operating system. The quality of rules is summarized in Table 3. In Stat- Apriori, the main measure function was the z-score, but the binomial probabilities (p-values) were used for redundancy reduction, too. A rule was considered redundant, if it had either a lower z-score than its parent rules or if the lower bound of log(p) was higher than the minimum upperbound for parents log(p). This strategy proved to be efficient when the frequencies become low and z-scores expand. For comparison, the rules we selected with the χ 2 - measure, J-measure, z-score and frequency, after normal frequency-based pruning. StatApriori produced very accurate results on all data sets, except Chess and Garden. The latter was difficult for all measures, because all patterns are very rare. For proper analysis, an ontology of Genus Species Subspecies Variety should be used. The poor behaviour on Chess is harder to explain. For all other measures the rules were selected with exceptionally high minimum frequency (min fr = 0.75). This means that the consequence holds on at least 75% of rows and the error is less than 25%, even if the antecedent is empty. In fact, the rules did not represent any correlations, but the consequents were totally independent from the antecedents. In all cases, StatApriori produced the strongest lift. This is understandable, because the statistical significance measures the correlations. When the z-score was used with the minimum frequency thresholds, the lift values were much smaller. The accuracy was also poorer, which suggests that the z-score suffers for frequency-based pruning. Quite likely, the same holds for the chi 2. It is noteworthy that the StatApriori performed faster than the traditional Apriori in all test cases, even if no minimum frequency thresholds were used. The maximum execution time was for Chess, 110s. The large minimum frequencies for the Apriori are partly due to heavy postprocessing. For feasibility, the thresholds were set to avoid over rules. However, the dense data sets are difficult for Apriori even without this restriction. For example, Apriori cannot handle Chess with min fr < Conclusions Searching statistically significant association rules is an important but neglected problem. So far, it has been considered computationally infeasible with any larger data sets. In this paper, we have shown that its is possible to search all statistically significant rules in a reasonable time. We have introduced a set of effective pruning properties and a breadth-first search strategy, StatApriori, which implements them. StatApriori can be used in two ways: either to search K most significant association rules or all rules passing the given significance threshold (minimum z-score). This enables the user to solve the multiple testing problem (i.e. setting the significance threshold) in a desired way or use the algorithm only for ranking the most significant rules. In the same time, StatApriori solves another important problem, and prunes out all redundant association rules. According to experimental results, this improves the rule quality by avoiding overfitting. Together, the z-score and redundancy reduction provide a robust method for rule discovery. I.e. the discovered rules have a high probability to hold in the future data. As far as we know, this is the first algorithm of its kind. The few existing algorithms have either searched only classification rules with statistical measures or used frequencybased pruning in some extent. Both of these strategies are likely to lose significant association rules. In the future research, we are going to improve the efficiency further, using a suitable indexing structure or additional pruning criteria. The final goal is to develop an efficient algorithm for searching the most significant, general association rules, containing propositional logic formulas. 7. Acknowledgments We thank Finnish Concordia Fund ( konkordia-liitto.com/) for supporting this re- 8

9 search. References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages , Washington, D.C., [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB 94, pages Morgan Kaufmann, [3] A. Agresti and Y. Min. Frequentist performance of bayesian confidence intervals for comparing proportions in 2 2 contingency tables. Biometrics, 61:515523, [4]. Baralis and P. Garza. A lazy approach to pruning classification rules. In Proceedings of the 2002 I International Conference on Data Mining (ICDM 02), page 35. I Computer Society, [5] Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal. Mining minimal non-redundant association rules using frequent closed itemsets. In Proceedings of the First International Conference on Computational Logic (CL 00), volume 1861 of Lecture Notes in Computer Science, pages Springer-Verlag, [6] F. Berzal, I. Blanco, D. Sánchez, and M. A. V. Miranda. A new framework to assess association rules. In Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis (IDA 01), volume 2189 of Lecture Notes In Computer Science, pages , London, UK, Springer-Verlag. [7] K. Carriere. How good is a normal approximation for rates and proportions of low incidence events? Communications in Statistics: Simulation and Computation, 30: , [8] D. Freedman, R. Pisani, and R. Purves. Statistics. Norton & Company, London, 4th edition, [9] L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. ACM Computing Surveys, 38(3):9, [10] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:1330, [11] C. Jermaine. Finding the most interesting correlations in a database: how hard can it be? Information Systems, 30(1):21 46, [12] B. Lindgren. Statistical Theory. Chapman & Hall, Boca Raton, U.S.A., 4th edition, [13] H. Mannila, H. Toivonen, and A. Verkamo. fficient algorithms for discovering association rules. In Papers from the AAAI Workshop on Knowledge Discovery in Databases (KDD 94), pages AAAI Press, [14] R. Meo. Theory of dependence values. ACM Transactions on Database Systems, 25(3): , [15] S. Morishita and A. Nakaya. Parallel branch-and-bound graph search for correlated association rules. In Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, volume 1759 of Lecture Notes in Computer Science, pages Springer-Verlag, [16] S. Morishita and J. Sese. Transversing itemset lattices with statistical metric pruning. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS 00), pages ACM Press, [17] S. Nijssen and J. Kok. Multi-class correlated pattern mining. In Proceedings of the 4th International Workshop on Knowledge Discovery in Inductive Databases, volume 3933 of Lecture Notes in Computer Science. Springer-Verlag, [18] G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. Frawley, editors, Knowledge Discovery in Databases, pages AAAI/MIT Press, [19] J. Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46: , [20] C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2(1):39 68, [21] P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right objective measure for association analysis. Information Systems, 29(4): , [22] H. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H. Mannila, M. Herr, and J. Kere. Data mining applied to linkage disequilibrium mapping. American Journal of Human Genetics, 67: , [23] G. Webb. Discovering significant rules. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 06), pages , New York, USA, ACM Press. [24] G. I. Webb. Discovering significant patterns. Machine Learning, 68(1):1 33,

10 Table 2. Description of data sets and test parameters. Data StatApr Apriori n k min cf min Z min fr 1a Mushroom b Mushroom a Chess b Chess a T10I4D100K b T10I4D100K a T40I10D100K b T40I10D100K a Plants b Plants a Garden b Garden Table 3. Average rule accuracy and lift with different measure functions. StatApr Apriori test Z&p chi 2 J z fr γ err γ err γ err γ err γ err 1a b a b a b a b a b a b

Guaranteeing the Accuracy of Association Rules by Statistical Significance

Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge