Efficient discovery of statistically significant association rules

Size: px

Start display at page:

Download "Efficient discovery of statistically significant association rules"

Shon Ward
6 years ago
Views:

1 Efficient discovery of statistically significant association rules Wilhelmiina Hämäläinen Deartment of Comuter Science University of Helsinki Finland ABSTRACT In this aer, we introduce a new, effective algorithm for searching non-redundant, statistically significant association rules. StatAriori is based on the classical Ariori algorithm, but for each attribute set, we already generate the best non-redundant association rule. In addition, we estimate the uer-bound for the significance of more secial rules. If the uer-bound is too low to roduce any non-redundant, significant rules, the whole branch can be runed. StatAriori guarantees that no significant rules are missed (tye 2 error) and the number of non-significant rules generated is minimal (tye 1 error). In the classical Ariori, tye 2 error can be avoided by a sufficiently low minimum frequency threshold, in the cost of tye 1 error. In addition, the small frequency threshold causes exonential exlosion and the roblem becomes intractable. StatAriori can tackle many of these difficult roblems efficiently. According to emirical results, StatAriori needs only a small fraction of time and memory comared to the classical Ariori. Categories and Subject Descritors H.2.8 [Database Management]: Database Alications- Data mining General Terms association rules, statistical significance 1. INTRODUCTION Traditional association rules [2] are rules of form if event X = x occurs, then also event A = a is likely to occur. The commonness of the rule is measured by frequency P (X = x, A = a) and the strength of the rule by confidence P (A = a X = x). For comutational uroses it is required that both frequency and confidence should exceed some userdefined thresholds. The actual interestingness of the rule is usually decided afterwards, by some interestingness measure. P (X=x,A=a) P (X=x),P (A=a) Often the associations are interreted as deendencies between certain attribute value combinations. However, traditional association rules do not necessarily cature statistical deendencies, but they can associate absolutely indeendent events while ignoring strong deendencies. As a solution, it is often suggested (following Piatetsky-Shairo [13]) to measure the degree of deendence instead of the confidence (e.g. [16, 17, 1]). This roduces also statistically more sound results, since the statistical significance of a rule is a function of its frequency and degree of deendence. I.e., they define the robability that the rule had occurred by chance, if the events were actually indeendent. The main roblem of the traditional association rules is the ignorance of statistical significance. Frequency-confidencebased discovery roduces many surious rules but misses statistically significant rules. In statistics, these two error tyes are called tye 1 and tye 2 errors. In the worst case, all discovered rules can be surious [18, 19]. The roblems of association rules and esecially frequencyconfidence-framework are well-known ([18, 19, 4, 1, 5, 12]), but still there has been only few attemts to solve the roblem. The reason is that its is considered comutationally infeasible to search all statistically significant rules. However, statistically sound algorithms (e.g. [11, 12]) have been introduced for searching decision rules, with a fixed consequent. In addition, there is a number of aroaches (e.g. [8, 11, 12]) which use statistical measures, like χ 2, for selecting the best association rules. Unfortunately, none of these aroaches is suitable for association rules, and they can cause both tye 1 and tye 2 errors.[6]. In this aer, we introduce a new algorithm called StatAriori which solves the roblem. StatAriori guarantees that no significant rules are missed, while the number of surious rule candidates generated during the execution is ket minimal. In addition, it roduces only minimal significant rules, runing out all redundant rules. Comared to a modification of the classical Ariori algorithm, which also roduces all significant association rules, StatAriori is very efficient. It can tackle roblems which are imossible to comute with the classical Ariori. The secret of StatAriori lies in two runing techniques:

2 First, for each generated attribute set X, StatAriori estimates an uer-bound for the statistical significance of rules derived from X s suersets. If the uer-bound is lower than the required statistical significance, the whole branch can be runed. Second, StatAriori selects the best association rule immediately 1, when a otentially significant attribute set is generated. Thus, the statistical significance of all more common rules is known, and a branch can be runed, if it would roduce only redundant rules. The organization of the aer is the following: The basic definitions are given in Section 2 and the theoretical results in Section 3. The algorithm is resented in Section 4 and exerimental results in Section 5. The final conclusions are drawn in Section BASIC DEFINITIONS In the following we give basic definitions of the association rule, statistical deendence, statistical significance, and redundancy. The notations are introduced in Table 1. Table 1: Basic notations. Notation Meaning A, B, C,... binary attributes a, b, c,... {0, 1} attribute values R = {A 1,..., A k } set of all attributes R = k number of attributes in R Dom(R) = {0, 1} k attribute sace X, Y, Z R attribute sets Dom(X) = {0, 1} l Dom(R) domain of X, X = l (X = x) = {(A 1 = a 1 ),..., event, X = l (A l = a l )} t = {A 1 = t(a 1),..., A k = t(a k )} row r = {t 1,..., t n t i Dom(R)} relation (data set) r = n size of relation r σ X=x(r) = {t r t[x] = x} set of rows where X = x m(x = x) = σ X=x (r) number of rows where X = x P (X = x) = m(x=x) relative frequency of n X = x 2.1 Association rules Traditionally, association rules are defined in the frequencyconfidence framework: Definition 1 (Association rule). Let R be a set of binary attributes and r a relation according to R. Let X R and Y R \ X, be attribute sets and x Dom(X) and y Dom(Y ) their value combinations. The confidence of rule (X = x) (Y = y) is cf(x = x Y = y) = and the frequency of the rule is P (X = x, Y = y) P (X = x) fr(x = x Y = y) = P (X = x, Y = y). = P (Y = y X = x) 1 In addition to statistical connotations, word stat is an abbreviation for the Latin word statim, immediately. Given user-defined thresholds min cf, min fr [0, 1], rule (X = x) (Y = y) is an association rule in r, if (i) cf(x = x Y = y) min cf, and (ii) fr(x = x Y = y) min fr. The first condition requires that an association rule should be strong enough and the second condition requires that it should be common enough. In this aer, we call rules association rules, even if no thresholds min fr and min cf are secified. Usually, it is assumed that the consequent Y = y contains just one attribute (i.e. Y = 1) and the rule contains only ositive attribute values (A i = 1). Now, the rule can be exressed simly by listing the attributes, e.g. A 1, A 3, A 5 A 2. From the statistical oint of view, the direction of a rule (X Y or Y X) is a matter of choice. In this aer, we follow a common convention and assign the single attribute into the consequence. If the rule contains just two attributes, the direction is selected randomly. 2.2 Statistical deendence Statistical deendence is usually defined through statistical indeendence (e.g. [15, 9]): Definition 2 (Indeendence and deendence). Let X R and Y R \ X be sets of binary attributes. Events X = x and Y = y, x Dom(X), y Dom(Y ), are mutually indeendent, if P (X = x, Y = y) = P (X = x)p (Y = y). If the events are not indeendent, they are deendent. The strength of a statistical deendency between (X = x) and (Y = y) can be measured by a degree of deendence (also called deendence [20], degree of indeendence [21], or interest [5]): γ(x = x, Y = y) = P (X = x, Y = y) P (X = x)p (Y = y). (1) 2.3 Statistical significance The idea of statistical significance tests is to estimate the robability of the observed or a rarer henomenon, under some null hyothesis. When the objective is to test the significance of the deendency between X = x and Y = y, the null hyothesis is the indeendence assumtion: P (X = x, Y = y) = P (X = x)p (Y = y). If the estimated robability is very small, we can reject the indeendence assumtion, and assume that the observed deendency is not due to chance, but significant at level. The smaller is, the more significant the observation is. The significance of the observed frequency m(x, Y ) can be estimated exactly by the binomial distribution. Each row in

3 relation r, r = n, corresonds to an indeendent Bernoulli trial, whose outcome is either 1 (XY occurs) or 0 (XY does not occur). All rows are mutually indeendent. Assuming the indeendence of attributes X and Y, combination XY occurs on a row with robability P (X)P (Y ). Now the number of rows containing X, Y is a binomial random variable M with arameters P (X)P (Y ) and n. The mean of M is µ M = np (X)P (Y ) and its variance is σm 2 = np (X)P (Y )(1 P (X)P (Y )). Probability P (M m(x, Y )) gives the significance : = nx i=m(x,y ) n i (P (X)P (Y )) i (1 P (X)P (Y )) n i. (2) This can be aroximated by the standard normal distribution 1 Φ(t), where Φ(t(X, Y )) = 1 e u2 /2 du is the standard normal cumulative distribution function and t(x, Y ) = is standardized m(x, Y ): t(x, Y ) = 2π R t(x,y ) m(x, Y ) µm σ M m(x, Y ) np (X)P (Y ) np (X)P (Y )(1 P (X)P (Y )). (3) The cumulative distribution function Φ(t) is quite difficult to calculate, but for the association rule mining it is enough to know t(x, Y ). Since Φ(t) is monotonically increasing, robability is monotonically decreasing in the terms of t(x, Y ). Thus, we can use t as a measure function for ranking association rules according to their significance. On the other hand, we know that according to Chebyshev s inequality (the roof is given e.g. in [10, ]): P K < M µm σ M «< K 1 1 K. 2 I.e. P (t K) < 1 2K 2. For examle, requirement K 2 corresonds to statistical significance level = Bonferroni adjustment Usually the minimum requirement for any significance is It means that there is a 5% chance that a surious rule asses the significance test ( tye 1 error ). If we test rules, it is likely that we will find 500 surious rules. This so called multile testing roblem is inherent in the knowledge discovery, where we often erform an exhaustive search over all ossible atterns. As a solution, the more atterns we test, the stricter bounds for the significance we should use. The most well-known method is Bonferroni adjustment [14], where the desired significance level is divided by the number of tests m. The same effect is achieved by using mk instead of K as a minimum threshold for t. In the association rule discovery, we can give an uer bound for the number of rules to be tested. However, this rule is so strict that there is a risk that we do not recognize all significant atterns ( tye 2 error ). As a solution, we suggest that testing (the search) is finished, before the rules become too comlex, and m increases so high that no statistical significance can be guaranteed. 2.5 Redundancy A common goal in association rule discovery is to find the most general rules (containing the minimal number of attributes) which satisfy the given search criteria. There is no sense to outut comlex rules X Y, if their generalizations Z Y, Z X are at least equally significant. Generally, the goal is to find minimal (or most general) interesting rules, and rune out redundant rules [3]. In this aer we define redundancy as follows: Definition 3 (Minimal and redundant rules). Given some interestingness measure M, rule X Y is a minimal rule, if there does not exist any rule X Y such that X Y X Y and M(X Y ) M(X Y ). If the rule is not minimal, then it is redundant. In our case, function t is used as a measure function, but in rincial M can be any function which increases with the interestingness. When redundancy reduction is alied in ractice, we have to check two cases: 1. A rule is redundant, if there are at least equally good, more common rules. 2. A rule is not the most significant, minimal rule, if all its secializations are better. 3. THEORETICAL RESULTS The statistical significance of rule X Y is a function of its frequency P (X, Y ) and degree of deendence γ(x, Y ). Equation 3 can be resented equivalently as t(x Y ) = np (X, Y )(γ 1) γ P (X, Y ). (4) The higher the frequency is, the lower the degree of deendence can be, and vice versa. The following theorem exresses the relationshi between the frequency and the degree of deendence: Theorem 1. When t(x Y ) = K, P (X, Y ) = K 2 γ n(γ 1) 2 + K 2.

4 Proof. By solving t(x Y ) = np (X)(γ 1) γ P (X) = K. This result can be used for runing areas in the search sace, when an uer-bound for γ is known. The simlest method to search all statistically significant rules is to search all frequent sets with sufficiently small min fr and then select from each frequent set the rules with sufficient t. The following theorem gives a safe minimum frequency threshold for the whole data set. It guarantees that no significant rules are missed. Theorem 2. Let min = min{p (A i A i R}. Let K 2 be the desired significance level. For all sets X R and any A X (i) γ(x \ A A) 1 min and (ii) X A cannot be significant, unless Proof. By solving for γ = 1 min. K 2 min P (X) n(1 min ) 2 + K 2 2. min t(x \ A A) = np (X)(γ 1) γ P (X) K This result enables discovery of statistically significant association rules with the normal Ariori algorithm, roceeded by an aroriate rule generation hase. Unfortunately, the resulting min fr is so low that nearly all attribute sets are selected and the comutation becomes unfeasible. However, this aroach can be used as a benchmark for evaluating the efficiency and correctness of other algorithms. The theorem gives also the highest K-value which can be used and still find rules. This is imortant information, when the alicability of Bonferroni adjustment is decided. In the extreme case, it can mean that there is not enough data to draw any statistical conclusions. Corollary 1. If K > n(1 min) 1+ min, no attribute set can roduce significant rules. Better runing can be achieved by using several minimum frequencies. The following theorem defines a minimum frequency for the given set X: Theorem 3. Let min(x) = min{p (A i) A i X} and K P (X) < 2 min. Then for all Y X such that n(1 min ) 2 +K 2 2 min min(y ) = min(x) all rules Y \ A A are insignificant at level K. Proof. Let Y X such that min (Y ) = min (X) and A Y. Now γ(y \ A), P (Y ) P (X) and t(y \ A A) according to the assumtion. 1 min (X) np (X)(1 min (X)), which is < K min (X)(1 P (X) min (X)) The following corollary exresses the anti-monotonicity roerty used for runing: Corollary 2. No significant rules can be derived from Y, unless all subsets X Y with min (X) = min (Y ) have frequency K 2 min P (X). n(1 min ) 2 + K 2 2 min If set X is sufficiently frequent, it is called otentially significant. Efficient runing can be achieved by the following strategy: First, all attributes are arranged into an ascending order by their frequencies. Then the attribute sets are generated level by level, in a canonical order. From set X = {A i1,..., A il }, where P (A i1 ),..., P (A il ), we can generate sets X {A j }, where j > i l. Now all suersets of X have the same uer-bound for the degree of deendence, γ 1 P (A i1. If X is not otentially significant, then ) the whole branch can be runed. The uer-bound for t can be used for redundancy reduction, too. If X or its subsets have already roduced a rule with higher or equal t than the uer-bound, then the branch can be runed. 4. STATAPRIORI ALGORITHM Next, we give the StatAriori algorithm which imlements the theoretical results. In addition, we give an analysis of the worst-case time comlexity. 4.1 Algorithm The StatAriori algorithm is given in Figure 1. The main algorithm is otherwise similar to the traditional Ariori, excet it generates the best rule for each attribute set immediately. This enables efficient runing of non-roductive search aths. First all attribute sets are initialized by single attributes. If the suersets of the attribute cannot roduce any significant rules, the attribute is runed. After that the algorithm roceeds in the breadth-first manner, by generating new candidate sets (subroutine GenCands) and runing attribute sets whose suersets cannot roduce any minimal, significant rules (subroutine PruneCands). Subroutine SelectRules generates the best rules from discovered attribute sets and checks their redundancy relative to more general rules. For simlicity, the collections of attribute sets are reresented as arrays and each attribute set is defined as a record with four fields:

5 Figure 1: Algorithm StatAriori(R, r, K) for searching all minimal, significant association rules from data. Inut: set of attributes R, data set r, significance level K Outut: minimal, significant rules Method: 1 P S 1 = ; 2 for i = 0 to R // initialize 1-sets 3 P S 1 [i].set = {R[i]}; // database ass: 4 count frequencies P S 1[i].fr from r; 5 for i = 0 to R 6 if (otmaxt( r, P S 1 [i].fr, P S 1 [i].fr) K) 7 rune P S 1 [i]; 8 arrange P S 1 into ascending order by P S 1[i].fr; 9 l = 2; 10 while ((l R ) and ( P S l l 1)) 11 P ar = ; 12 (C l, P ar)=gencands(p S l 1 ); // Fig (P S l, P ar)=prunecands(c l, P ar, K); // Fig P S l =SelectRules(P S l, P S l 1 P ar); // Fig l++; // Outut minimal, significant rules: 16 for i = 2 to l 1 17 for j = 0 to P S i 1 18 if (((P S i [j].red == 0)) and (P S i [j].maxt K) and (not childrenbetter(l, j))) 19 outut best rule(s) of P S i [j]; set fr cons maxt attributes, frequency, consequence of the best rule, and t-value of the rule or its arent rule, if the rule is redundant. We note that in the imlementation, there is no need for searate collections, but all collections can be stored into one indexing structure, like a trie. Sets of l attributes are called l-sets. The collection of all l-candidate sets is denoted by C l and the collection of all otentially significant l-sets is denoted by P S l. GenCands (Figure 2) generates l-candidate sets from otentially significant l 1 sets. Candidate set X is generated only if it or any of its secializations Z X can roduce a significant rule. The necessary condition is that all X s arent sets Y X, Y = l 1, containing the same minimum attribute (the same uer bound for γ) are sufficiently frequent and thus in P S l 1. The only arent with a different minimum attribute does not have to be in P S l 1. However, if it is missing, it is added to a temorary collection P ar for the frequency counting. Its frequency is needed later when selecting the best rule of X. PruneCands (Figure 3) calculates the frequencies of l-candidate sets and secial arent sets from the database. Then it checks if a candidate set or any of its secializations can roduce a significant rule. If the frequency is too low, the Figure 2: Algorithm GenCands(P S l 1, l). for generating l-candidate sets. Inut: otentially significant l 1 sets P S l 1, sizel Outut: l-candidates C l, secial arents P ar Method: 1 C l = ; 2 next = 0; next2 = 0; 3 for (i = 0) to P S l for (j = i + 1) to P S l if ((commonmin(p S l 1 [i], P S l 1 [j])) and ( P S l 1 [i].set P S l 1 [i].set == l 2)) 6 C l [next].set=p S l 1 [i].set P S l 1 [j].set); 7 next++; // all arents with the same minimum // attribute have to be otentially significant 8 if ( Y C l [next].set (( Y == l 1) and (commin(y, C l [next].set)) and (Y / P S l 1 )) 9 next ; // remove from P S l 10 else 11 X = C l [i].set\ {minattr(x)}; 12 if (X / C l 1 ) 13 P ar[next2].set=x; // add secial arent 14 next2++; 15 return (C l, P ar) whole branch is runed. SelectRules (Fig. 4) checks the redundancy of otentially significant sets X. If any of the arent rules (more common rules) have higher or equal t-value than the uer-bound for t in X and its children, then X is runed. If X was saved, but some of its arents have higher t-value, then X is marked as redundant. X s t-value is relaced by the arents maximal t-value. In this way, it is always enough to check the immediate arents t-values for redundancy checking. The auxiliary functions used in the algorithm are defined in Figure 5. Figure 3: Algorithm PruneCands(C l, l, K) for runing l-candidate sets whose canonical suersets cannot be significant. Inut: l-candidates C l,size l, significance level K Outut: otentially significant l-sets P S l, secial arents P ar Method: 1 P S l = ; // database ass: 2 count frequencies C l [i].fr and P ar[j].fr from r; 3 for i = 0 to C l 1 4 min = min{p (A) A C l [i].set}; 5 if (otmaxt( r,c l [i].fr, min) K) 6 P S l = P S l C l [i]; 7 return (P S l, P ar);

6 Figure 4: Algorithm SelectRules(P S l, P S l 1 ) for selecting the best, minimal rules from P S l. Inut: otentially significant l-sets P S l and l 1-sets P S l 1 Outut: modified P S l Method: 1 for i = 0 to P S l 1 2 // calculate an uer-bound for t in the branch 3 maxγ = max{γ(p S l [i].set \ {A} A) A P S l [i]}; 4 P S l [i].cons=a; // consequent 5 P S[i].maxt = t(n, P S l [i].fr, maxγ) 6 if j((p S l 1 [j].set P S l [i].set) and (P S l 1 [j].maxt P S l [i].maxt))) 7 P S l [i].red = 1; // rule is redundant 8 P S l [i].maxt = P S l 1 [j].maxt; 9 return P S l Figure 5: Functions commonmin, t, otmaxt, and childrenbetter. commonmin(x, Y ) return (( A X Y ) and ( B X Y (P (B) P (A)))) minattr(x) return A X( B XP (A) P (B)) t(n, fr, γ) return n fr(γ 1) γ fr ; otmaxt(n, fr, min) return ( n fr(1 min) ) min(1 fr min) childrenbetter(l, ind) return ( C l+1 [j] (C l [ind] C l+1 [j]) (not (C l+1 [j].red) and (C l+1 [j].maxt > C l [ind].maxt))) 4.2 Time comlexity It is known that the roblem of searching all frequent attribute sets is NP -hard, in the terms of the number of attributes, k [7]. The worst case haens, when the most significant association rule involves all k attributes, and all 2 k attribute sets are generated. The worst case comlexity of the algorithm is O(min{k 2, n}2 k ). Theorem 4. The worst-case time comlexity of StatAriori is O(min{k 2, n}2 k ), where n is the number of rows and k is the number of attributes. Proof. The initialization (generation of 1-sets) takes n k stes. Producing l-sets and their best rules takes l 2 C l + 2n C l log k + 2 P S l l time stes. The first term is the time comlexity of the candidate generation. Each candidate has l arents and each arent can be found in the trie in l 1 stes. The second term is comlexity of the runing hase. The database is read (n rows) and on each row C l candidates are checked. In the worst case, all candidates have an extra arent which has to be checked, too. Checking takes in the worst case log k stes, when the data is stored as bit vectors and inclusion test is imlemented with logical bit oerations. The third term is the comlexity of the rule selection hase: for each of P S l sets, all l arents are checked. Checking is done at most twice: once for calculating the maximal t- value (selecting the best rule) and second time for checking the redundancy. Each checking can be imlemented in constant time, if the arent ointers are stored into a temorary structure in candidate generation hase. Since C l P S l, the total comlexity is P k l=2 min{l2, n} C l < min{k 2, n} P k ` k l=2 l = O(min{k 2, n}2 k ). 5. EXPERIMENTAL RESULTS StatAriori was tested with real and simulated data sets, which are often used for testing association rule discovery rograms. All data sets were achieved from FIMI reository for frequent itemset mining (htt://fimi.cs.helsinki.fi/). Two of the data sets, mushroom and chess, were dense and known to be athological cases for the traditional Ariori. The other two, T10I4D100K and T40I10D100K, were large but sarse sets, thus reminding tyical market-basket data. In addition, we tried modified data sets (market by asterisc, *) which were roduced from the original sets by selecting randomly half of the attributes. If the row became emty, it was also removed. All exeriments were executed on Intel Core Duo rocessor T GHz with 1 GB RAM, 2MG cache, and Linux oerating system. The results are given in Table 2. The original idea was to comare StatAriori to traditional Ariori. However, it turned out that traditional Ariori run out of memory (after long excecution) in most tests, because the minimum frequency threshold was so low. With higher thresholds, some significant rules could have been missed.

7 When Ariori was halted by the system, we have reorted the last arameter values. The only tests which Ariori assed successfully were with the modified mushroom data. With K = 90 StatAriori generated only 0.03% of the sets Ariori did and with K = %, i.e. less than a romille. With the original mushroom data Ariori was halted at level 6, but still the number of otentially significant sets was five times the number Stat- Ariori generated during the whole execution. It was also checked that all rules discovered by StatAriori were minimal significant rules, and all extra rules roduced by Ariori were redundant. Therefore, StatAriori roduced a smaller number of rules and the maximal length of rules was also smaller. T10I4D100K and T40I10D100K were difficult even for Stat- Ariori, even if the largest ossible K-value was used. This was not a surrise, since the minimum frequency threshold was only StatAriori finished T10I4D100K in about 2 hours (the exact rocessor time was lost). T40I10D100K took over a night (intermediate results were lost) and roduced rules with K = 315. With higher K values, some significant rules would have been lost. We note that both StatAriori and traditional Ariori algorithms were imlemented without any secial otimizations. The discovered attribute sets were stored into a trie structure, but the data was stored into bit vectors without any indexing structure. Since the database ass is the main bottle-neck, the erformance could be imroved with an efficient indexing structure. To test the effect of indexing structures, we tried an Ariori imlementation with a refix tree for the data by C. Borgelt (FIMI reository, htt://fimi.cs.helsinki.fi/src/). The refix tree imroved only the execution of those roblems which were anyway tractable, but it could not solve any new roblems. In addition, Borgelt s rogram lost some frequent (and otentially significant) 1-sets when the lowest frequency thresholds were used. We tested StatAriori also with an on-line Bonferroni adjustement. In the beginning the initial K (e.g. K = 2) is multilied with the squareroot of number of 1-sets, tyically k. The K value is udated after every l-collection to reflect the current number of tests. The resulting search was very efficient, but it suffers for one roblem: if the next K-value, K l, is too high for acceting any rules, we can backtrack to the revious value K i, i < l, which actually roduced significant rules. However, it is ossible that the number of rules, which are otentially significant at level K i, is so large that K should be udated. Thus, we should iterate between K i and K l until we find such K that {X X is otentially significant at level K} K. 6. CONCLUSIONS In this aer, we have introduced a new algorithm for searching minimal, statistically significant association rules. As far as we know, this is the first algorithm for discovering all statistically significant association rules. The revious algorithms have solved the roblem only for decision rules or used statistical measures which are not designed for association rules, thus roducing tye 1 or tye 2 errors. According to our exeriments with real and simulated data, the algorithm is reasonably efficient. It searches only a small fraction of search sace exlored by the traditional Ariori with a sufficiently small minimum frequency threshold. (In fact, if we would allow tye 2 errors like other aroaches, it would cometite with the existing algorithms.) In addition, it roduces only minimal association rules, without any further runing ste. The main bottle-neck is the same as with Ariori: the database ass. This can be imroved with a suitable indexing structure. Such a structure will be develoed in the future research, ideally for any association rules, including negations. 7. REFERENCES [1] C. Aggarwal and P. Yu. A new framework for itemset generation. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symosium on Princiles of Database Systems (PODS 1998), ages 18 24, New York, USA, ACM Press. [2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, ages , Washington, D.C., [3] Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal. Mining minimal non-redundant association rules using frequent closed itemsets. In Proceedings of the First International Conference on Comutational Logic (CL 00), volume 1861 of Lecture Notes in Comuter Science, ages , London, UK, Sringer-Verlag. [4] F. Berzal, I. Blanco, D. Sánchez, and M. A. V. Miranda. A new framework to assess association rules. In Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis (IDA 01), volume 2189 of Lecture Notes In Comuter Science, ages , London, UK, Sringer-Verlag. [5] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In J. Peckham, editor, Proceedings ACM SIGMOD International Conference on Management of Data, ages ACM Press, [6] W. Hämäläinen. Assessing the statistical significance of association rules. Data Mining and Knowledge Discovery, Submitted. [7] C. Jermaine. Finding the most interesting correlations in a database: how hard can it be? Information Systems, 30(1):21 46, [8] B. Liu, W. Hsu, and Y. Ma. Pruning and summarizing the discovered associations. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 99), ages , New York, USA, ACM Press. [9] R. Meo. Theory of deendence values. ACM Transactions on Database Systems, 25(3): , [10] J. Milton and J. Arnold. Introduction to Probability

8 and Statistics: Princiles and Alications for Engineering and the Comuting Sciences. McGraw-Hill, New York, 4th edition, [11] S. Morishita and A. Nakaya. Parallel branch-and-bound grah search for correlated association rules. In M. Zaki and C.-T. Ho, editors, Revised Paers from Large-Scale Parallel Data Mining, Worksho on Large-Scale Parallel KDD Systems, SIGKDD, volume 1759 of Lecture Notes in Comuter Science, ages , London, UK, Sringer-Verlag. [12] S. Morishita and J. Sese. Transversing itemset lattices with statistical metric runing. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symosium on Princiles of database systems (PODS 00), ages , New York, USA, ACM Press. [13] G. Piatetsky-Shairo. Discovery, analysis, and resentation of strong rules. In G. Piatetsky-Shairo and W. Frawley, editors, Knowledge Discovery in Databases, ages AAAI/MIT Press, [14] J. Shaffer. Multile hyothesis testing. Annual Review of Psychology, 46: , [15] C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets: Generalizing association rules to deendence rules. Data Mining and Knowledge Discovery, 2(1):39 68, [16] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD Record, 25(2):1 12, [17] P. Tan and V. Kumar. Interestingness measures for association atterns: A ersective. Technical Reort TR00-036, Deartment of Comuter Science, University of Minnesota, [18] G. Webb. Discovering significant rules. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 06), ages , New York, USA, ACM Press. [19] G. I. Webb. Discovering significant atterns. Machine Learning, 68(1):1 33, [20] X. Wu, C. Zhang, and S. Zhang. Efficient mining of both ositive and negative association rules. ACM Transactions on Information Systems, 22(3): , [21] Y. Yao and N. Zhong. An analysis of quantitative measures associated with rules. In Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining (PAKDD 99), ages , London, UK, Sringer-Verlag.

9 Table 2: Comarison of StatAriori and traditional Ariori with different data sets. The fields are the following: #sets gives the total number of otentially significant sets generated, #rules gives the number of significant rules roduced, size the maximum number of attributes er rule, level the last level rocessed, and time the execution time in seconds. Most tests were intractable for the traditional Ariori, results before halting are reorted. Inut StatAriori Ariori set (n,k), K, min fr #sets #rules size level time #sets #rules size level time mushroom (8124,119) > K = 87,min fr = mushroom K = 90,min fr =???? mushroom*() K = 90, min fr = mushroom* K = 80, min fr = chess* (3196,74) K = 55, min fr == chess* > K = 56, min fr = T10I4D100K ( ,870) h K = 315, min fr = T10I4D100K* (96 218,298) > K = 310, min fr = T40I10D100K ( , 942)? ?? >8h K = 315, min fr = T40I10D100K* ( ,470) K = 310, min fr =

Guaranteeing the Accuracy of Association Rules by Statistical Significance

Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge