Guaranteeing the Accuracy of Association Rules by Statistical Significance

Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge discovery method, even if the results can be highly misleading. The traditional frequencyconfidence framework can produce mostly spurious rules, while skipping the most significant rules. Several goodness measures have been proposed to solve this problem, but none of them is optimal. In this paper, we analyze the accruracy of the most popular and promising measure functions from the statistical pount of view. For each measure function, we analyze theoretically whether it can produce type 1 or type 2 errors. Finally, we report experiments which show that statistically significant rules can also produce more accurate predictions. 1 Introduction Traditional association rules [1] are rules of form if event X occurs, then also event A is likely to occur. The commonness of the rule is measured by frequency P (X, A) and the strength of the rule by confidence P (A X). For computational purposes it is required that both frequency and confidence should exceed some user-defined thresholds. The actual interestingness of the rule is usually decided afterwards, by some interestingness measure. Often the associations are interpreted as dependencies between certain attribute value combinations. However, traditional association rules do not necessarily capture statistical dependencies, but they can associate absolutely independent events while ignoring strong dependencies. Even if the rules express dependencies (indicated by lift, P (A X) P (A) ), they are not necessarily significant, but due to chance. In statistics, finding such spurious rules is called type 1 error. Webb [9] has tested the amount of spurious rules among frequent and strong rules, and found that in the worst case all discovered rules can be spurious. In practice, this means that the future data does not exhibit the discovered dependencies and the conclusions based on them are erroneous. On the other hand, the traditional association rules can reject the most significant rules. In statistics, this is called Type 2 error. The impact of type 2 error is harder to evaluate empirically, because the common search algorithms are based on frequency-based pruning. Several interestingness measures (see e.g. [4] for an overview) have been proposed to solve this problem. Some of them have their origins in statistics, but none of them tests the statistical significance of the dependency expressed in the rule.

2 W. Hämäläinen In this paper, we analyze, how well the most common and statistically promising measures capture the statistical significance. For each measure, we analyze theoretically, whether it can produce type 1 or type 2 errors. In addition, we report experiments which test the effect of statistical significance to prediction accuracy. The organization of the paper is the following: The basic definitions are given in Section 2. The theoretical analysis of common measures is given in Section 3. Experiments are reported in Section 4 and the final conclusions are drawn in Section 5. 2 Basic Definitions In the following we give basic definitions of the association rule, statistical dependence, statistical significance, and redundancy. The notations are introduced in Table 1. Table 1. Basic notations. Notation Meaning A, B, C,... binary attributes a, b, c,... {0, 1} attribute values R = {A 1,..., A k } set of all attributes R = k number of attributes in R Dom(R) = {0, 1} k attribute space X, Y, Z R attribute sets Dom(X) = {0, 1} l Dom(R) domain of X, X = l (X = x) = {(A 1 = a 1),..., event, X = l (A l = a l )} t = {A 1 = t(a 1),..., A k = t(a k )} row r = {t 1,..., t n t i Dom(R)} relation (data set) r = n size of relation r σ X=x (r) = {t r t[x] = x} set of rows where X = x m(x = x) = σ X=x (r) absolute frequency of X = x P (X = x) = m(x=x) n relative frequency of X = x 2.1 Association Rules Traditionally, association rules are defined in the frequency-confidence framework as follows: Definition 1 (Association rule). Let R be a set of binary attributes and r a relation according to R. Let X R, A R\X, x Dom(X), and a Dom(A).

Guaranteeing the Accuracy of Association Rules by Statistical Significance 3 The confidence of rule (X = x) (A = a) is cf(x = x A = a) = and the frequency 1 of the rule is P (X = x, A = a) P (X = x) fr(x = x A = a) = P (X = x, A = a). = P (A = a X = x) Given user-defined thresholds min cf, min fr [0, 1], rule (X = x) (A = a) is an association rule in r, if (i) cf(x = x A = a) min cf, and (ii) fr(x = x A = a) min fr. The first condition requires that an association rule should be strong enough and the second condition requires that it should be common enough. In this paper, we call rules association rules, even if no thresholds min fr and min cf are specified. Usually, it is assumed that the rule contains only positive attribute values (A i = 1). Now, the rule can be expressed simply by listing the attributes, e.g. A 1, A 3, A 5 A 2. 2.2 Statistical Dependence Statistical dependence is classically defined through statistical independence (e.g. [5]): Definition 2 (Independence and dependence). Let X R and AinR \ X be sets of binary attributes. Events X = x and A = a, x Dom(X), a Dom(A), are mutually independent, if P (X = x, A = a) = P (X = x)p (A = a). If the events are not independent, they are dependent. The strength of the statistical dependence between (X = x) and (A = a) can be measured by lift or interest: γ(x = x, A = a) = P (X = x, A = a) P (X = x)p (A = a). (1) 2.3 Statistical Significance The idea of statistical significance tests is to estimate the probability of the observed or a rarer phenomenon, under some null hypothesis. When the objective is to test the significance of the dependency between X = x and A = a, the 1 Often either frequency P (X, Y ) or absolute frequency m(x, Y ) is called support. In this paper we use statistical term frequency, which is unambiguous.

4 W. Hämäläinen null hypothesis is the independence assumption: P (X = x, A = a) = P (X = x)p (A = a). If the estimated probability p is very small, we can reject the independence assumption, and assume that the observed dependency is not due to chance, but significant at level p. The smaller p is, the more significant the observation is. The significance of the observed frequency m(x, A) can be estimated exactly by the binomial distribution. Each row in relation r, r = n, corresponds to an independent Bernoulli trial, whose outcome is either 1 (XA occurs) or 0 (XA does not occur). All rows are mutually independent. Assuming the independence of attributes X and A, combination XA occurs on a row with probability P (X)P (A). Now the number of rows containing X, A is a binomial random variable M with parameters P (X)P (A) and n. The mean of M is µ M = np (X)P (A) and its variance is σm 2 = np (X)P (A)(1 P (X)P (A)). Probability P (M m(x, A)) gives the significance p: p = m(x) i=m(x,a) ( n i ) (P (X)P (A)) i (1 P (X)P (A)) n i. (2) This can be approximated by the standard normal distribution where Φ(z(X, A)) = 1 distribution function and is standardized m(x, A): z(x,a) 2π z(x, A) = p 1 Φ(z), e u2 /2 du is the standard normal cumulative z(x, A) = m(x, A) µ M σ M m(x, A) np (X)P (A) np (X)P (A)(1 P (X)P (A)). (3) The cumulative distribution function Φ(z) is quite difficult to calculate, but for the association rule mining it is enough to know z(x, A). Since Φ(z) is monotonically increasing, probability p is monotonically decreasing in the terms of z(x, A). Thus, we can use the z-score as a measure function for ranking association rules according to their significance. On the other hand, we know that according to Chebyshev s inequality (the proof is given e.g. in [6, pp. 780-781]): P ( K < M µ M σ M ) < K 1 1 K 2. I.e. P (z K) < 1 2K 2. For example, requirement K 2 corresponds to statistical significance level p = 0.05.

Guaranteeing the Accuracy of Association Rules by Statistical Significance 5 3 Analysis of Common Measures In the following, we analyze common interestingness measures which could test the statistical sgnificance. The measures are χ 2, Pearson correlation coefficient φ, and J-measure. 3.1 χ 2 measure Definition The χ 2 -independence test may be the best-known statistical test for detecting dependencies between attributes. The idea of the χ 2 test is to compare the observed frequencies O(m(X)) to the expected frequencies E(m(X)) by χ 2 (X) = x Dom(X) O((m(X = x)) E(m(X = x))) 2. E(m(X = x)) When the test variable is approximately normally distributed, the test measure follows the χ 2 -distribution. Usually this assumption holds for large n. As a rule of thumb, it is suggested (e.g. [6, 630]) that all of the expected frequencies should be at least 5. When we test a dependency between two attribute sets, X and Y, the test metric is χ 2 (X, Y ) = 1 i=0 j=0 1 (m(x = i, Y = j) np (X = i)p (Y = j)) 2 np (X = i)p (Y = j) n(p (X = 1, Y = 1) P (X = 1)P (Y = 1)) 2. P (X = 1)P (X = 0)P (Y = 1)P (Y = 0) = If χ 2 (X, Y ) is less than the critical χ 2 value at level p and 1 degree of freedom, X and Y are statistically independent with probability 1 p. Otherwise, the dependency is significant at level p. Applying χ 2 in the association rule discovery The simplest way to use χ 2 - measure in the association rule discovery is to generate rules from frequent sets based on their χ 2 -values. For each frequent set Y, all rules of form Y \ C C with a sufficient χ 2 -value are selected (e.g. [3]). This approach does not find all rules which are significant in the χ 2 sense. First, the rules are preselected according to their frequency. If the minimum frequency is set too high, some significant rules are missed. Second, it is possible that a weak rule (P (C Y ) 0.5) is selected, because its companion rules Y C, Y C, and/or Y C are significant. The rule confidence can be used to check that P (C Y ) > P ( C Y ), but it does not guarantee that Y C is significant.

6 W. Hämäläinen The first problem would be solved, if we could search the rules directly with the χ 2 -measure. Unfortunately, this is not feasible, since χ 2 -measure is not monotonic. For any rule X C and its generalization Y C, Y X, it is unknown, whether χ 2 (X C) > χ 2 (Y C) or χ 2 (X C) χ 2 (Y C). Morishita et al. (e.g. [7]) have utilized the convexity of the χ 2 function, when the consequent C is fixed. The idea is to prune a branch containing rule Y C and all its specialization rules X C, Y X, if max{χ 2 (X C)} < min χ 2 for the given cutoff value min chi 2. Because χ 2 is convex, max{χ 2 (X C)} < min χ 2 can be bounded by equation { } np (Y, C)P ( C) χ 2 (X C) max (1 P (Y, C))P (C), np (Y, C)P (C). (1 P (Y, C))P ( C) Now the frequency-based pruning is not necessary and it is possible to find all rules with a sufficient χ 2 -value or the best rules in the χ 2 sense. This approach works correctly, when the goal is to find full dependencies. Analysis The main problem of the χ 2 -independence test is that it designed to measure dependencies between attributes. That is why it can fail to detect significant partial dependencies. On the other hand, χ 2 -test can yield a high value, thus indicating a significant dependency, even if the tested events were nearly independent. Negative correlations can be pruned by an extra test, P (X, Y ) > P (X)P (Y ), but it does not guarantee that the high χ 2 -value is due to X Y. Let us analyze the χ 2 -value, when P (X, Y ) = P (X)P (Y ) + d. Now χ 2 can be defined in the terms of d: χ 2 (X, Y ) = nd 2 P (X)P ( X)P (Y )P ( Y ). χ 2 is high, when n and d are large and P (X)P ( X)P (Y )P ( Y ) is small. The minimum value (16nd 2 ) is achieved, when P (X) = P (Y ) = 0.5, and the maximum, when P (X) and P (Y ) approach either 0 or 1. For example, if P (X) = P (Y ) = 0.01, χ 2 = 10000nd 2, and even minimal d suffices. E.g. if n = 1000, d 0.8 10 3 for level 0.01, and P (X, Y ) = 0.0009. The problem is that if P (X) and/or P (Y ) are large, the relative difference d P (X)P (Y ) is small and the partial dependency between X and Y is not significant. Still the χ 2 d -value can be large, because P ( X)P ( Y ) is large. Thus, the high χ2 - value is due to partial dependency X Y, and X Y is a false discovery (type 1 error). Example 1. Let P (X) = P (Y ) = 1 ɛ for arbitrary small ɛ > 0. Let d be maximal i.e. d = P (X)(1 P (Y )) = (1 P (X))P (Y ) = ɛ(1 ɛ) < ɛ. (The d relative difference is still very small, P (X)P (Y ) = ɛ 1 ɛ.) Now χ2 (X, Y ) is very large, the same as the data size, n:

Guaranteeing the Accuracy of Association Rules by Statistical Significance 7 χ 2 = nd 2 P (X)P (Y )(1 P (X))(1 P (Y )) = nɛ2 (1 ɛ) 2 ɛ 2 (1 ɛ) 2 = n. Still, rule X Y is insignificant, since n(1 ɛ)ɛ nɛ z(x Y ) = (1 ɛ) 1 (1 ɛ) = 0, 2 2 ɛ when ɛ 0. The high χ 2 -value is due to partial dependency X Y, which has a high z-score: n(1 ɛ) z( X Y ) = n, 1 + ɛ when ɛ 0. Rules X Y and X Y are meaningless, with nɛ(1 ɛ) nɛ(1 ɛ) t = < = nɛ 0. 1 ɛ + ɛ 2 1 ɛ χ 2 -measure is less likely to cause type 2 errors. If an association rule is just sufficiently significant, it passes also the χ 2 -test. The reason is that the χ 2 -value of rule X Y increases quadratically in the terms of its z-score: Theorem 1. If z(x Y ) = K, then χ 2 (X, Y ) K 2. Proof. Let x = P (X) and y = P (Y ). If z(x Y ) = K, then nd 2 = K 2 xy(1 xy) and χ 2 (X, Y ) = since (1 x)(1 y) 1 xy for all x, y [0, 1]. nd 2 xy(1 x)(1 y) = K2 (1 xy) (1 x)(1 y) K2, 3.2 Correlation coefficient Pearson correlation coefficient is sometimes used to rank association rules. Traditionally, Pearson correlation coefficient is used to measure linear dependencies between numeric attributes. When the Pearson correlation coefficient φ is calculated for the binary attributes, it reduces to the square root of χ 2 /n: φ(x, Y ) = P (X, Y ) P (X)P (Y ) P (X)P ( X)P (Y )P ( Y ) = χ2 (X, Y ). n Like χ 2 (X, Y ), φ(x, Y ) = 0, when P (X, Y ) = P (X)P (Y ), and the variables are mutually independent. Otherwise, the sign of φ tells whether the correlation is positive (φ > 0) or negative (φ < 0). We will first show that a rule can be insignificant, even if correlation coefficient φ(x, Y ) = 1. This means that φ-measure can produce false discoveries (type 1 error).

8 W. Hämäläinen Observation 1 When P (X) and P (Y ) approach 1, it is possible that φ(x, Y ) = 1, even if z(x, Y ) < K for any K > 0. Proof. Let P (X) = P (Y ) = 1 ɛ for arbitrary small ɛ > 0. Let d be maximal i.e. d = P (X)(1 P (Y )) = (1 P (X))P (Y ) = ɛ(1 ɛ). Now the correlation coefficient is 1: φ(x, Y ) = d (1 ɛ)ɛ = P (X)(1 P (X))P (Y )(1 P (X)) (1 ɛ)ɛ = 1. Still, for any K > 0, z(x, Y ) < K: t(x, Y ) = n(1 ɛ)ɛ (1 ɛ) 1 (1 ɛ) 2 = nɛ 2 ɛ < K ɛ < 2K2 n+k 2. On the other hand, it is possible that the φ-measure rejects significant rules (type 2 error), especially when n is large. The following lemma shows that this can happen, when P (X) and P (Y ) are relatively small. The smaller they are, the smaller n suffices. That is why the correlation coefficient should be totally avoided as a interest measure for association rules. Observation 2 It is possible that φ(x, Y ) 0, when n, even if rule X Y is significant. Proof. Let z(x, Y ) = nd P (X)P (Y )(1 P (X)P (Y )) = K. Then d = K P (X)P (Y )(1 P (X)P (Y )) n and φ(x, Y ) = K P (X)P (Y )(1 P (X)P (Y )) K 1 P (X)P (Y ) =. np (X)P (Y )(1 P (X))(1 P (Y )) n(1 P (X))(1 P (Y )) When P (X) p and P (Y ) p for some p < 1, φ(x, Y ) = K 1+p 0, when n(1 p) n. 3.3 J-measure J-measure [8] is often used to assess the interestingness of association rules, although it is originally designed for lerning classification rules. J-measure is an information-theoretic measure derived from the mutual information. For decision rules X C, J-measure is defined as J(C X) = P (X, C) log P (C X) P (C) + P (X, C) log P ( C X) P ( C) [0, [. The larger J is, the more interesting the rule should be. On the other hand, J(X, C) = 0, when the variables X and C are mutually independent (assuming that P (X) > 0).

Guaranteeing the Accuracy of Association Rules by Statistical Significance 9 J-measure contains two terms from the mutual information, M I, between variables X and C: MI(X, C) = J(C X) + J(C X). Thus, it measures the information gain in two rules, X C and X C. Rule X C has a high J-value, if its complement rule X C has high confidence (type 1 error). In 1 the extreme case, when P (C X) = 0, J(C X) = P (X) log P ( C). Type 2 error (rejecting true discoveries) can also occur with a suitable distribution. One reason is that J-measure omits n, which is crucial for the statistical significance. It can be easily shown that J 0, when P (X, Y ) 0 or P (Y ) 1. In the latter case, rule X C cannot be significant, but it is possible that a rule is significant, even if its frequency is relatively small: Example 2. Let P (C X) = 0.75 and P (C) = 0.5. Now J(C X) = P (X)(0.75 log 3 0.25) 0.94P (X) and z(x C) = np (X) 2 2 P (X). For example, when P (X) = 0.25, t = n 2, which high, when n is high. Still J(C X) 0.23, 7 which indicates that the rule is uninteresting. 3.4 Summary In Table 2, we give a summary of the analyzed measures. For each measure, we report, whether it can produce type 1 or type 2 error and all rules which affect the measure in addition to the actually measured rule. Table 2. Summary of measures M for assessing association rules. The occurrence of type 1 (accepting spurious rules) and type 2 (rejecting true discoveries) errors is indicated by + (occurs) and (does not occur). In addition, all rules which contribute to M(X Y ) are listed. For all measures except fr&cf, the antecedent and consequent of each rule can be switched. M Type 1 Type 2 Rules error error fr&cf + + X Y fr&γ X Y χ 2 + X Y, X Y, X Y, X Y φ + + X Y, X Y J + + X Y, X Y 4 Experimental Results The goal of the experiments was to test how well the z-score works in practice. According to the theoretical analysis, z-score overperforms χ 2, J-measure, and

10 W. Hämäläinen correlation factor, when the goal is to find statistically significant association rules. This means that the discovered dependencies should hold also in the unseen future data. If the rule is strong, the prediction error should also be low. An upper-bound for the mean error on unseen data is err (1 p α )P ( A X) + p α ( A), where p α is the significance level (probability that X and A are actually independent). Confidence P (A X) has a strong effect on the mean error, but when P (A X) = 1, the upper-bound for the error reduces to p α. To test the validity of this argument, only strong rules (min cf = 0.90) were searched from data sets. Each data set was divided into a training set and a test set. The rules were learnt from the training set and the prediction accuracy of 500 best rules was tested on the test set. The data sets are listed in Table 3. All of them were achieved from FIMI repository for frequent item set mining (http://fimi.cs.helsinki.fi/). In the first experiment, z-score was compared to χ 2, J-measure, frequency, and certainty factor. Correlation coefficient φ was skipped, because it produces an identical ordering with χ 2. On the other hand, we included certainty factor, which has produced good results in other experiments (e.g. [2]). First, the association rules were searched in a normal frequency-based manner and the measure functions were used for pruning in the post-processing phase. All redundant rules were pruned, i.e. rules X A for which there is a more common rule, X A, X {A } X {A}, such that M(X A ) M(X A), for the given measure function M. The results are shown in Figure 1. The test points (numbers of best rules to be tested) are selected such that the rules with the same measure function value are tested together. In this way, the order of rules with the same rank does not affect the results. The z-score and χ 2 had quite similar performance: they performed best on mushroom, second best on T10I4D100K and T40I10D100K, and poorest on chess. Certainty factors performed best on all other sets except mushroom. Mushroom was the only data set, where the rules had relatively low certainty factors; in all other sets most rules had certainty factor 1.0. Redundancy reduction had a large (positive) impact on the prediction accuracy, especially with certainty factors. Table 3. Data sets and minimum frequencies used in the first experiment. data set n k min fr mushroom 4062 119 0.10 chess 1598 76 0.10 T10I4D100K 50 000 870 0.0005 T40I10D100K 50 000 941 0.015

Guaranteeing the Accuracy of Association Rules by Statistical Significance 11 0.12 0.1 chi2 J z certainty fr 0.12 0.1 chi2 J z certainty fr 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 100 200 300 400 500 0 0 100 200 300 400 500 0.12 0.1 chi2 J z certainty fr 0.12 0.1 chi2 J z certainty fr 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 100 200 300 400 500 0 0 100 200 300 400 500 Fig. 1. The mean errors of frequent and strong association rules, using different measure functions. The number of best rules is given on the x-axis and the mean error on the y-axis. The data sets are mushroom (top left), chess (top right), T10I4D100K (bottom left) and T40I10D100K (bottom right). In the second experiment, association rules were searched directly with the z-score. The only requirement was that the rule holds on at least 5 rows. Unfortunately, the resulting min fr values were so low that the rules could not be searched with other measure functions, lacking efficient algorithms. For computational efficiency, chess and T40I10D100K were pruned by selecting every second attribute from all rows. The resulting data sets are notated chess* and T40I10D100K*. The results are given in Table 4 Now the results were quite different: more significant, strong rules were found than in the first experiment. The prediction error was also extremely low, zero in most cases. Nearly all of the rules were different than in he first experiment, i.e. the z-score was able to produce extremely accurate rules, when no min fr was specified.. 5 Conclusions The theoretical results show that all the evaluated measures, χ 2, J-measure, and correlation coefficient can produce either type 1 or type 2 errors. None of them

12 W. Hämäläinen Table 4. The most significant rules by z-scores. min fr and min cf give the minimum frequency and confidence in the discovered rule set. set n k z-score #rules mean error min fr min cf mushroom 4062 119 62.0 712 0.00 0.00047 1.00 chess* 1598 37 20.0 23 0.00 0.00016 1.0 T10I4D100K 46403 870 210.0 56 0.01 0.000097 1.0 T40I10D100K* 50000 470 210 66 0.04 0.000106 0.93 tests exactly the statistical significance of association rules. The only way to discover statistically significant rules is to use the z-score as a measure function. The empirical results suggest that the statistical significance can also guarantee the prediction accuracy, when the minimum frequency is not fixed. Since statistically significant association rules can be searched efficiently, they could be used to search also significant attribute dependencies in the χ 2 sense. References 1. R. Agrawal, T. Imielinski, and A.N. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207 216, Washington, D.C., 26 28 1993. 2. F. Berzal, I. Blanco, D. Sánchez, and M. Amparo Vila Miranda. A new framework to assess association rules. In Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis (IDA 01), pages 95 104, London, UK, 2001. Springer-Verlag. 3. X. Dong, F. Sun, X. Han, and R. Hou. Study of positive and negative association rules based on multi-confidence and chi-squared test. In X. Li, O.R. Zaïane, and Z. Li, editors, Proceedings of the Second International Conference on Advanced Data Mining and Applications (ADMA), volume 4093 of Lecture Notes in Computer Science, pages 100 109, Berlin / Heidelberg, 2006. Springer-Verlag. 4. L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. ACM Computing Surveys, 38(3):9, 2006. 5. R. Meo. Theory of dependence values. ACM Transactions on Database Systems, 25(3):380 406, 2000. 6. J.S. Milton and J.C. Arnold. Introduction to Probability and Statistics: Principles and Applications for Engineering and the Computing Sciences. McGraw-Hill, New York, 4th edition, 2003. 7. S. Morishita and J. Sese. Transversing itemset lattices with statistical metric pruning. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS 00), pages 226 236, New York, USA, 2000. ACM Press. 8. P. Smyth and R.M. Goodman. An information theoretic approach to rule induction from databases. IEEE Transactions on Knowledge and Data Engineering, 4(4):301 316, 1992. 9. G. I. Webb. Discovering significant patterns. Machine Learning, 68(1):1 33, 2007.