Efficient search of association rules with Fisher s exact test

Size: px
Start display at page:

Download "Efficient search of association rules with Fisher s exact test"

Transcription

1 fficient search of association rules with Fisher s exact test W. Hämäläinen Abstract limination of spurious and redundant rules is an important problem in the association rule discovery. Statistical measure functions like chi-squared, z-score and p-values offer a sound solution, but all of them are non-monotonic and computationally problematic. In addition, the approximate measures like chi-squared do not work well when relative frequencies are low. On the other hand, exact probabilities are hard to calculate for any large data set. We develop an efficient solution to this tricky problem. Nonmonotonicity is handled with a new algorithm, which also prunes redundant rules on-line. The quality of rules is estimated by Fisher s exact test. The initial experiments suggest that the method is both feasible and improves the quality of association rules remarkably. 1 Introduction The majority of data mining research has concentrated on scalable search methods. The main goal has been to handle always larger and larger data sets, while the quality of discovered patterns has often been forgotten. This is especially true of traditional association rule discovery. In the worst case all discovered rules can be spurious, while actually significant rules are missing [12, 11]. The problem is mainly computational, although there is debate how the significance of association rules should be estimated and whether it is even possible. In any case, statistical measure functions like χ 2, z-score and p-values can be used to select better rules. The problem is how to do this efficiently. A straight forward way is to first search frequent item sets in a traditional manner and use the measure functions afterwards to select the best rules. The problem is that the significance does not coincide with any absolute minimum frequency thresholds. If the threshold is set too high, many significant rules are missed. On the other hand, too low threshold produces a huge number of spurious and redundant rules, if the computation just finishes. However, in the same time the actual set of significant, non-redundant rules is relatively small. There is no need to traverse large epartment of Computer Science, University of Helsinki search spaces, if one just knows where to search. A natural alternative is to search the rules directly with the desired measure function. Unfortunately, none of these statistical measure functions is monotonic, i.e. a set of attributes can produce significant rules, even if its subset ( parent set ) cannot. However, a property does not have to be monotonic in all aspects, to enable efficient search. In the previous research [5], we have developed a new strategy for searching nonredundant significant association rules with a generic significance measure. The only requirement is that the measure function has a lower bound (an upper bound) which behaves monotonically i.e. either decreases or increases with the frequency and lift. Now we enlarge the algorithm for both positive and negative rules of form X A, X A, X A, X A. In practice, it is enough to search only positive correlations, but allow negation in the consequent. If rule X A expresses positive correlation, then X A expresses negative correlation. Fisher s exact test is a natural choice for the measure function, but it also is tricky to handle. Calculation is time consuming and can easily cause an overflow when any larger data set is processed. As a solution, we derive good approximations for the upper and lower bound of the p-value. The third new contribution is an efficient strategy for frequency counting. The technique is not bound to this special problem, but can be used in other algorithms as well. The idea to search both positive and negative rules is not new, but only few algorithms have been presented. In the approach by Wu et al.[13], negative rules were searched from a set of infrequent item sets. For pruning, they used a heuristic interestingness measure, in addition to minimum frequency thresholds. Antonie and Zaiane [2] did also use minimum frequency thresholds, but the strength of correlations was estimated by the Pearson correlation coefficient. According to authors analysis on one data set, the method produced interesting rules. The research by Koh and Pears [6, 7] is the most closely related to our research. They have also rejected absolute frequency thresholds and used Fisher s exact test as a significance criterion. The approach is quite

2 straigth forward application of the standard Apriori and no special strategies are used. The quality of discovered rules was not evaluated. Webb [12, 11] has also used Fisher s exact test to estimate the significance of association rules, but no new algorithm was introduced. According to empirical results, none of these methods is really scalable for large-scale data mining purposes. This is quite understandable since the problem is also demanding. In this sense, our algorithm can be considered really advantageous. An extra benefit is that the algorithm is general and suits for several measure functions. The rest of the paper is organized as follows: The basic definition are given in Section 2. In Section 3, we consider statistical significance measures and especially Fisher s exact test. The search algorithm and pruning strategies are introduced in Section 4. xperimental results are reported in Section 5 and the final conclusions are drawn in Section 6. The basic notations are introduced in Table 1. 2 Basic efinitions In the following we give basic definitions of the association rule, statistical dependence, and redundancy. 2.1 Association Rules Traditionally, association rules are defined in the frequency-confidence framework as follows: efinition 1. (Association rule) Let R be a set of binary attributes and r a relation according to R. Let X R, A R \ X, x om(x), and a om(a). The confidence of rule (X = x) (A = a) is cf(x = x A = a) = P (X=x,A=a) P (X=x) = P (A = a X = x) and the frequency of the rule is fr(x = x A = a) = P (X = x, A = a). Given user-defined thresholds min cf, min fr [0, 1], rule (X = x) (A = a) is an association rule in r, if (i) cf(x = x A = a) min cf, and (ii) fr(x = x A = a) min fr. The first condition requires that an association rule should be strong enough and the second condition requires that it should be common enough. Originally, it was assumed that the frequency corresponds the statistical significance of the rule [1]. In this paper, we call rules association rules, even if no thresholds min fr and min cf are specified. Usually, it is assumed that the rule contains only positive attribute values (A i = 1). Now, the rule can be expressed simply by listing the attributes, e.g. A 1, A 3, A 5 A 2. Rules of form X A, X A, or X A, are often called negative association rules, even if the negations can occur only in the front of either antecedent or consequent. 2.2 Statistical ependence Statistical dependence ( correlation ) is classically defined through statistical independence (e.g. [8]): efinition 2. Let X R and AinR \ X be sets of binary attributes. vents X = x and A = a, x om(x), a om(a), are mutually independent, if P (X = x, A = a) = P (X = x)p (A = a). If the events are not independent, they are dependent. The strength of the statistical dependence between (X = x) and (A = a) can be measured by lift or interest: (2.1) γ(x = x, A = a) = P (X = x, A = a) P (X = x)p (A = a). 2.3 Redundancy A common goal in association rule discovery is to find the most general rules (containing the minimal number of attributes) which satisfy the given search criteria. There is no sense to output complex rules X A, if their generalizations Z A, Z X are at least equally interesting. Generally, the goal is to find minimal (or most general) interesting rules, and prune out redundant rules [3]. In our case, the interestingness measure is the statistical significance, but in general, redundancy and minimality can be defined with respect to any other measure function. Generally, redundancy can be defined as follows: efinition 3. Given some interestingness measure M, rule X A is redundant, if there exists rule X A such that X {A } X {A} and M(X A ) M(X A). If the rule is not redundant, then it is called non-redundant. I.e. a rule is non-redundant, if all its generalizations ( parent rules ) are less significant. It is still possible that some or all of its specializations ( children rules ) are better. In the latter case, the rule is unlikely interesting itself. For negative association rules we extend the previous definition: efinition 4. Let M be like before. Rule X A is redundant, if there exists rule X A such that X {A } X {A} or X { A } X {A} or X {A } X { A} or X { A } X { A} and M(X A ) M(X A).

3 Non-redundant rules can be further classified as minimal or non-minimal: efinition 5. (Minimal rule) Non-redundant rule X A is minimal, if for all rules X A, such that X {A } X {A}, M(X A) M(X A ). I.e. a minimal rule is more significant than any of its parent or children rules. In the algorithmic level this means that we stop the search without checking any children rules, if we have just ensured that the rule is minimal. Once again, we can allow negation in the front of A or A. 3 Measuring the statistical significance The basic idea of statistical significance testing is to estimate the probability of the observed or more extreme event, under some null hypothesis. In the case of association rules, we have observed some frequency m(x, A) and lift γ(x, A) (or equivalently, the confidence, when m(a) is known). The null hypothesis is that rule X A is spurious,i.e. X and A are actually independent. If we want to measure the significance of positive correlation, we calculate the probability of observed or larger m(x, A), under the independence hypothesis. If the probability is very small, we can assume that the observation is not just due to chance, but X A represents a genuine pattern. So far everything should be clear. However, there are at least three ways to estimate the probability of the observed or rarer event under the independence assumption. All of them can be correct, depending on the purpose and situation. 3.1 Binomial probability, version 1 The simplest approach is to assume that actual P (X) and P (A) are the same as observed, and estimate the probability that X and A occur at least m(xa) times in the given data set (sample of size n), assuming that they are actually independent. This means that m(xa) can be at most min{m(x), m(a)}. Now it is enough to consider rows where X is true. On each row, A occurs with probability P (A), assuming the independence, and the probability of observing A at least m(xa) times from m(x) is p bin1 = m(x) i=m(x,a) ( m(x) i ) P (A) i (1 P (A)) n i. The probability can be estimated by the normal distribution, when we calculate the z-score: z(x A) = n(p (XA) P (X)P (A)). This is closely related to P (X)P (A)(1 P (A) the χ 2 -score, which is chi 2 (X A) = z 2 (X A) + z 2 ( X A). The problem of this measure is that the rules in the same data set cannot be compared. ach rule is tested in a different part of the data, where the antecedent X holds. According to our experiments, the results are very poor, if we use one absolute minimum threshold for all rules. As a solution, we can try to normalize the measure, and divide it by the maximum value np (X)(1 P (A)) P (A X) P (A) 1 P (A) P (X)P (A)(1 P (A). The result is, which happens to be Shortliffe s certainty factor [10], when γ > 1. The certainty factor can produce very accurate results when used for prediction purposes (e.g. [4]). However, it is very sensitive to redundant rules, because all rules with confidence 1.0 gain the maximum score, even if they occur on just one row. In addition, the certainty factor does not suit for branch-and-bound -style search, because the upper bound is always the same. 3.2 Binomial probability, version 2 There is another, better way to use the binomial probability for assessing the association rules. In this approach, we consider the whole data set and assume that on each row combination XA has probability P (X)P (A) to occur, if X and A are independent. Now we assume that P (X)P (A) is correct, under the independence assumption, but no assumptions are made on P (X) and P (A). Once again, the frequency of XA follows a binomial distribution: p bin2 = n ( n ) i=m(x,a) i (P (X)P (A)) i (1 P (X)P (A)) n i. The corresponding z-score is z(x A) = n(p (XA) P (X)P (A)) P (X)P (A)(1 P (X)P (A). Now all rules in one data set are comparable, and we can search all rules at a desired significance level (minimum z value). According to our experiments ([5]), the results are very good, compared to other measures. However, the z-score overestimates the significance, when the frequencies become low, which can both lead to suboptimal results and lengthen the search. As a results, we have used upper and lower bounds of the exact binomial probability to cut the search in time. Unfortunately, the binomial probability itself is not a monotonic function of P (A) and does not suit for our search algorithm as such. 3.3 Fisher s exact test A third alternative, used in Fisher s exact test, is to calculate the probability of the whole distribution of X and A. This approach is also taken in [11] for assessing the significance of association rules. When we estimate the significance of positive correlation between X and A (or equivalently, X and A), the probability of the observed or stronger correlation is mx m(x)!m( X)! m(a)!m( A)! n!(m(x, A) + i)!(m(x A) i)! (m( XA) i)!(m( X A) + i)!, i=0 where m = min{m(x, A), m( X, A)}. This is the same as the significance of negative correlation between X and A or X and A. Thus, we can easily

4 search both positive and negative correlation with the same measure, but testing the significance of positive correlations between X and A and X and A. The factorials in the equation can be problematic (causing an overflow), when the data set is any larger, but for search algorithm it is sufficient to have good upper and lower bounds. In addition, we can take logarithm of p which is often easier to calculate. 3.4 Lower bound for the Fisher probability A lower bound for p(x A) is achieved by considering the best possible rule, which can be derived from X. A tight lower bound can be derived from the Stirling inequality: n n 2πne Q1 e n < n! < nn 2πne Q2 e n, where Q1 = (12n + 1) 1 and Q2 = (12n) 1. Let A be the minimum attribute which can be added to X, i.e. P (A) P (B = b) or P ( A) P (B = b) for all B Y, b {0, 1}, and for all Y X. The best p-value is achieved, when both the frequency and the lift are maximal, i.e. m(x, A) = m(x) and γ = min{p (A), P ( A)} 1. There are two cases: 1. If m(x) == m(a), p = m(a)!m( A)! n! P (A) m(a)+0.5 P ( A) m( A)+0.5 2πne Q, where Q > 1. Therefore,we can omit e Q. 2. If m(x) < m(a), p = P ( X) m( X)+0.5 P (A) m(a)+0.5 (P (A) P (X)) m(a) m(x)+0.5 e Q. Once again Q > 1. m(a)!m( X)! n!(m(a) m(x))! The first lower bound is an increasing function of P (A), when P (A) < 0.5. Therefore, it can be used to derive a minimum frequency threshold: min fr = max{p (A) p(bestrule(a)) > max p }. 3.5 Upper bound for the Fisher probability An upper bound for p is achieved by multiplying the upper bound of the first term in the Fisher s sum equation by the number of terms. In the best case, when P (XA) = P (X) or P (XA) = P (A), there is just one term. Otherwise, the number of terms is tnum = min{m(x A), m( XA)} + 1. The first term can be estimated quite accurately using the Gosper s approximation: n! nn (6n + 1)π e n. 3 Now we can derive the following upper bounds: 1. When m(xa) < m(x) and m(xa) < m(a) and m( X A) > 0, p(x A) tnum ( ) m(xa) ( ) m( XA) P (X)P (A) P ( X)P (A) P (XA) P ( XA) ( ) m(x A) ( ) m( X A) P (X)P ( A) P ( X)P ( A) P (X A) P ( X A) (6m(X)+1)(6m( X)+1) (6n+1)(6m(XA)+1)(6m(X A)+1) (6m(A)+1)(6m( A)+1)3 (6m( XA)+1)(6m( X A)+1)π. 2. When m(xa) < m(x) and m(xa) < m(a) and m( X A) == 0, we know that m(x A) = m( A) and m( XA) = m( X). p(x A) ( m(xa) m( A) P (X) P (XA)) P (X) m( A) P (A) m(a). 3. When m(xa) = m(x) = m(a), p(x A) P (A) m(a) P ( A) m( A) (6m(A)+1)(6m( A)+1)π. (6n+1)3 4. When m(xa) = m(x) < m(a), p(x A) P (A) m(a) P ( X) m( X) (6m( X)+1)(6m(A)+1). (P (A) P (X)) m(a) m(x) (6m(A) 6m(X)+1)(6n+1) In practice we can group the equations into several parts by taking logarithm of p. This also helps to avoid the overflow. 4 Algorithm FishApriori implements a search algorithm for association rules containing both positive and negative correlations between positive attribute values. The main principle (a branch-and-bound search with a special enumeration tree) has been adapted from the StatApriori algorithm [5], which searches only significant positive correlations between positive attribute values. For FishApriori, we have developed a new strategy for handling negative correlations. In addition, we have improved the overall efficiency with a new effective technique for frequency counting. 4.1 The main principles The main pruning principle is based on the following observation: if measure I is an increasing function of lift γ(x, A), its upper bound is a decreasing function of P (A), and vice versa, when frequency P (XA) is fixed. This holds for several common measure functions, as stated in the classical axioms by Piatetsky-Shapiro [9]. In the following, we will assume that the measure function is increasing with significance, although the probability behaves contrary. (One can imagine, that the actual measure function is the reverse, p 1.) Now the significance is a monotonic property as long as the sets contain the same minimum attribute, Mina(X) = arg min{p (A) A X}.

5 Theorem 4.1. Let f M (fr, γ) be an increasing function of fr and γ and M(X A) = f M (P (XA), γ(x, A)). Then for any attribute sets Y, X Y we have (i) UB(Bestrule(X)) = f M (P (X), (Mina(X)) 1 ), and (ii) If Mina(X) Mina(Y ), then UB(Bestrule(Y )) = f M (P (X), (Mina(X)) 1 ). Proof. Follows from the observation γ(x A) 1 P (A) 1 min{p (A) A X}. Now it is enough to generate the attribute sets in such an order that the children sets (more special sets) have the same minimum attribute as their parents (more general sets). If the sets are generated in a correct order, each (l + 1)-set has at most one parent set with a different minimum attribute. In practice this means that property P S, potentially significant, is monotonic in all other relations. This can be achieved when the new attribute sets are generated according to an ordered enumeration tree. C B C A Figure 1: A complete enumeration tree (dash line) and an example tree (solid line). Figure 1) shows an example of an ordered enumeration tree, when R = {A, B, C,, }, and P (A) P (B)... P (). Solid lines represent an example tree, when two levels are generated, and dash lines show the nodes missing from the complete tree. Missing nodes at level 2 are non-p S and it is enough to generate new children sets from the existing P S nodes. Negative values in the consequents and the new frequency counting method complicate the algorithm, but the same idea can still be applied. Now the attributes are arranged in ascending order by min{p (A), P ( A)}. The nodes are classified into three categories according to their significance: 1. Set X is P S, potentially significant, if significant rules can be derived from X. This C B C is true, if UB(I(bestrule(X))) min I i(p (X), P (Mina(X) 1 ) min I. 2. Otherwise set X is non-p S. Such a set can still be a parent of a P S set in the previously mentioned special case, but no children are generated from it. (a) Set X is absolutely non-p S, if none of its children can be significant, i.e. i(p (X), minp) 1 ) < min I, where minp = min{p (A), P ( A) A R}. Such nodes can be removed from the tree immediately. (b) Set X is non-p S,but not absolutely non- P S, if i(p (X), P (Mina(X) 1 ) < min I but i(p (X), minp) 1 ) min I. This means that X can be a parent of a P S set, and it is saved until the next level is generated. 4.2 Frequency counting Frequency counting is the main bottle-neck of association rule discovery algorithms. It is especially time consuming in the breadthfirst search (which could otherwise be faster), because all l-candidates are stored and checked at once. This is also space consuming, which becomes the final burden, when there is no more memory left. In the depth-first search checking is usually done in smaller batches, even one by one, using some additional data structure. Constructing and storing the data structure takes time and space, too, but when the data set is dense, the strategy is usually more efficient. Our new approach combines the benefits of both strategies. The attribute sets are generated in breadthfirst manner, but the frequencies are checked immediately. If the candidate is useless, it can be deleted immediately. Normally, it would be hopeless to check candidates one by one from the data, but the new data structure changes the situation. The data is stored into a reversed bitmap, where rows and columns are reversed. Instead of n rows of k-bit vectors we have k rows of n-bit vectors. ach attribute A is implemented as a bit vector, which has bit 1 in the ith position, if A occurs on the ith row. The trick is that now we can count any frequencies by simple bit-and operations. The frequency of set X is the number of 1 bits in the bit-and vector of vectors A X. Counting the frequencies of all l-candidates in collection C l takes at most C l l n time instead of C l k n time. In practice, there are several ways to implement the bit-and operation and bit counting more efficiently. 4.3 Pseudocode The pseudocode of the FishApriori algorithm is give in Algorithms 4.1 and 4.2.

6 Algorithm 4.1. Algorithm FishApriori(R, r, max p ) Input: set of attributes R, data set r, threshold max p Output: minimal, significant rules Method: 1 determine min fr from max p 2 if (P (A) < min fr ) R = R \ {A} 3 k = R ; l = 1; 4 arrange attributes A R such that (P (A i ) min{p (A i+1, P ( A i+1 }) or (min fr P ( A i ) min{p (A i+1 ), P ( A i+1 )}) 5 for i = 0 to i < k add A i to C l 6 while ( C l l + 1) 7 for all X i, X j C l 8 Y = GenCand(X i, X j ) // is Y absolutely non-p S or redundant? 9 if ((LB(p(Y, A min )) max p ) and (LB(p(Y, A min )) max{z.p Z Y })) 10 C l+1 = C i+1 {Y } 11 count P (Y ) // is Y non-p S or redundant? 12 if (LB(p(Y, Mina(Y ))) > max p ) 13 Y.P S = 0 14 if (LB(p(Y, Mina(Y ))) max{z.p Z Y })) 15 Y.Red = 1; Y.p = max{z.p} 16 else calculate Y.p 17 mark if Y minimal 18 output minimal, significant rules Algorithm 4.2. Algorithm GenCand(X 1, X 2 ). Input: potentially significant l-sets X 1 and X 2 Output: (l + 1)-candidate Y Method: 1 if (P S(X 1 ) and ( Minimal(X 1 )) and ((P S(X 2 )) or (l == 2)) and ( Minimal(X 2 ))) 2 Y = X 1 X 2 3 if for all Z Y, ((Z C l ) and ((P S(Z)) or (Mina(Z) Mina(Y ))) and ( Minimal(Z))) 4 return Y 5 else return NULL 4.4 xample Let us take an example. An example data set is given in Table 2. The order of attributes is defined by P (A) = 0.05 < P (B) = 0.10 = P ( ) < P (C) = 0.20 < P () = This means that in the first branch, under label A, the maximum possible lift is P (A) 1 = 20, in the second branch P (B) 1 = 10, in the third branch P ( ) 1 = 10 etc. The logarithm of the maximum p is 1. On the second level, all sets except C have been potentially significant. Figure 2 shows the enumeration tree, when the third level is beginning. Solid line nodes have been added to the tree and the frequencies are given in the nodes Table 2: xample data. m(abc ) = 2 m(ab C) = 1 m(abc ) = 1 m(a B C) = 1 m( AB C) = 1 m( ABC ) = 3 m( ABC ) = 2 m( A BC ) = 10 m( A BC ) = 2 m( A B C) = 42 m( A B C ) = 5 m( A B C) = 30 A 5 B C C C C B C C Figure 2: An example enumeration tree, when the third level is processed. First, the algorithm generates candidate AB from AB and A. The third parent, B, is also in the tree which means that the candidate can be P S. The frequency of AB is counted and the best rule (with the highest lift) is generated. The rule is B A. The lower bound of ln(p) is It is higher than the parent s ln(p), -7.6, and the rule is potentially nonredundant. The second set, ABC is also P S. However, its best rule, C A, has a lower ln(p) (-7.4) than its parent BC (-9.6) and the set is marked as redundant. The parents ln(p) is copied to the node. The next candidates AB, AC are P S, but not significant. Set A produces a non-redundant, significant rule A, with ln(p) = 2.1. Set AC is not generated, because one of its parents, C, is absolutely non-p S (and thus missing from the tree). The next set is BC and its best rule is BC with lower bound ln(p) = 7.4. However, this is lower than BC:s ln(p) and the rule is redundant C

7 B is P S but not significant. The last sets, BC and B cannot be significant because parent sets C and C are absolutely non-p S. On the fourth level, no more rules are generated. The program outputs five non-redundant, significant rules given in Table 3. For comparison, we have calculated also the exact ln(p) values, in addition to estimated upper bounds. With larger data sets the error is much smaller. Table 3: Rules derived from the example data. The last column gives the actual ln(p) and UB is the estimated lower bound. rule f r cf γ ln(u B(p)) ln(p) B C A B A C A C xperiments FishApriori was tested with several large data sets and the results were compared to the traditional Apriori, using χ 2 -measure in the post processing phase. The Apriori program was an efficient implementation with a prefix-tree by C. Borgelt (FIMI repository, It produced only positive association rules and therefore we selected only positive correlations with the highest χ 2 -values. Redundancy reduction and rule ranking was implemented in a separate program. The data sets and test parameters are described in Table 4. All data sets were tested with two minimum confidence thresholds, 0.6 and 0.9. The goal was to test both the prediction accuracy (accuracy of strong rules) and robustness (genuineness of correlations). In the first case, the natural measure is the average test error. In the latter case, the prediction error can be large (if the method is robust, the expected error is 0.4), but the correlations should be at least as strong in the test set as in the learning set. To measure the robustness, we calculated the z- score for the difference of the lift distribution (characterized by the mean and variance) between the learning set and the test set: z( γ) = µ(γ L) µ(γ T ), σ 2 (γ L ) n L + σ2 (γ T ) n T where γ L is the lift variable in the learning set and γ T in the test set, µ and σ 2 notate the mean and variance, n L is the number of rules considered in the learning set, and n T is the number of these rules, which could be applied in the test set. The closer the z-score is to zero, the more similar the distributions are. If the score is negative, then the correlations are stronger in the test set than in the learning set. For example, Figures 3 and 4 show the lift distribution in a learning and test set of data set T40I10100K. The distributions are very similar which hints that the same correlations in both data sets. also very small, #rules < lift The z-score was >92 Figure 3: The lift distribution among 100 best rules in a learning set of T40I10100K. #rules < lift >92 Figure 4: The lift distribution among 100 best rules in a test set of T40I10100K. The first four data sets are classical benchmark data sets from FIMI repository for frequent item set mining ( Plants lists all plant species growing in the U.S.A. and Canada. ach row contains a species and its distribution (states where it grows). The data has been extracted from the USA plants database ( index.html). Garden lists recommended plant combi-

8 nations. The data is extracted from several gardening sources (e.g. All experiments were executed on Intel Core uo processor T GHz with 1 GB RAM, 2MG cache, and Linux operating system. The average execution times are given in Table 4. The quality of rules is summarized in Table 5. For brevity, we report only the 100 best rules. For each test we give average lift, z-score of the difference of the lift between the learning and test sets, prediction error in the test set, average frequency, confidence, rule length (number of attributes in the antecedent), and, for FishApriori, the number of rules with a negative consequent. All tests were repeated ten times with different learning and test sets, each time 2/3 of data in the learning set. Both measures behaved quite robustly, as expected, since both of them have been designed for measuring statistical dependencies. In the previous research we have found that this is not necessarily the sace with other measure functions. With Mushroom, FishApriori produced clearly smaller prediction error. The lift was also slightly higher. Mushroom generally quite easy set to find accurate rules. Chess was the only set where FishApriori found negative rules. When the confidence threshold was 0.6, it found 27 negative rules, but with confidence 0.9, just one. In the first case the prediction accuracy was quite poor. In the latter case, the prediction accuracy was already good, slightly better than with the traditional Apriori. We note that the traditional Apriori required extremely high minimum frequency threshold, 0.75, to be feasible. Partly this was due to extra post processing (redundancy reduction and ranking), but even without post processing, Apriori could not be run with min fr < 0.5. Another interesting observation is the lift, which was 1.0 for Apriori. In fact the strongest and most frequent rules in Chess are between independent events. Therefore, the correlation measures tend to select infrequent and weaker rules. T10I4100K and T40I10100K are quite similar sets, reminiscent of the real market-basket data. In T40I10100K, the transactions are larger and the computation is heavier. In T10I4100K, the traditional Apriori with χ 2 produced a smaller error with the lower confidence, but with the higher confidence, the errors were the same. With T40I10100K, the traditional Apriori performed clearly better. The prediction errors were smaller and also the lift was higher. Once again, FishApriori had problems to find enough significant rules, and the significance threshold was loosened. With Plants and Garden FishApriori outperformed the Apriori. The prediction errors were remarkably smaller, especially with the Garden. Garden is an especially difficult set, because all patterns (plant combinations involving certain varieties, colours, leaf types etc.) are very rare. Still FishApriori was able to select the genuine patterns among all rare combinations. The traditional Apriori produced a clearly higher lift. This is quite natural, because the lift favours rare items in the consequent, and the resulting rules can be too infrequent. With these data sets, FishApriori tested several potentially significant negative rules, but none of them belonged to the 100 best rules. 6 Conclusions We have considered the problem of discovering both positive and negative association rules efficiently. In the same time, the main objective has been the quality of discovered rules. As a solution, we have applied Fisher s exact test, and pruned redundant rules online. The resulting FishApriori algorithm extends our previous algorithm, but several new results were needed, before the main idea could be applied to negative rules. In addition, we had to derive computationally feasible upper and lower bounds for Fisher s p-value. The algorithm was tested carefully with several large data sets. The results were really encouraging. With most data sets, including computationally difficult dense data sets, the execution time was less a second. This is especially remarkable, because no minimum frequency thresholds were used and even the p value was set high enough so that at least 100 rules could be found. The efficiency was partly due to a new frequency counting method, which can be used to accelerate other algorithms, as well. The rule quality was also very good, both in the terms of prediction accuracy and robustness of correlations. However, significant negative correlations were relatively rare in the tested data sets. In the future research, our next goal is to tackle general association rules which can contain negations anywhere in the antecedent or consequent. The bit operations in the frequency could also be accelerated further. Garden data set motivates to develop techniques for searching significant rules among different levels of attribute hierarchies. References [1] R. Agrawal, T. Imielinski, and A.N. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMO International Conference on Management of ata, pages , Washington,.C.,

9 [2] M.-L. Antonie and O. R. Zaïane. Mining positive and negative association rules: an approach for confined rules. In Proceedings of the 8th uropean Conference on Principles and Practice of Knowledge iscovery in atabases (PK 04), pages 27 38, New York, NY, USA, Springer-Verlag New York, Inc. [3] Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal. Mining minimal non-redundant association rules using frequent closed itemsets. In Proceedings of the First International Conference on Computational Logic (CL 00), volume 1861 of Lecture Notes in Computer Science, pages Springer-Verlag, [4] F. Berzal, I. Blanco,. Sánchez, and M. Amparo Vila Miranda. A new framework to assess association rules. In Proceedings of the 4th International Conference on Advances in Intelligent ata Analysis (IA 01), pages Springer-Verlag, [5] W. Hämäläinen and M. Nykänen. fficient discovery of statistically significant association rules. In Proceedings of the 8th I International Conference on ata Mining (ICM 2008), To appear. [6] Y. S. Koh. Mining non-coincidental rules without a user defined support threshold. In Advances in Knowledge iscovery and ata Mining, Proceedings of the 12th Pacific-Asia Conference (PAK 2008), volume 5012 of Lecture Notes in Computer Science, pages Springer, [7] Y.S. Koh and R. Pears. fficiently finding negative association rules without support threshold. In AI 2007: Advances in Artificial Intelligence, Proceedings of the 20th Australian Joint Conference on Artificial Intelligence (AI 2007), volume 4830 of Lecture Notes in Computer Science, pages Springer, [8] R. Meo. Theory of dependence values. ACM Transactions on atabase Systems, 25(3): , [9] G. Piatetsky-Shapiro. iscovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W.J. Frawley, editors, Knowledge iscovery in atabases, pages AAAI/MIT Press, [10].H. Shortliffe and B.G. Buchanan. A model of inexact reasoning in medicine. Mathematical Biosciences, 23: , [11] G. I. Webb. iscovering significant patterns. Machine Learning, 68(1):1 33, [12] G.I. Webb. iscovering significant rules. In Proceedings of the 12th ACM SIGK international conference on Knowledge discovery and data mining (K 06), pages , New York, USA, ACM Press. [13] X. Wu, C. Zhang, and S. Zhang. fficient mining of both positive and negative association rules. ACM Transactions on Information Systems, 22(3): , 2004.

10 Table 1: Basic notations. Notation Meaning A, B, C,... binary attributes a, b, c,... {0, 1} attribute values R = {A 1,..., A k } set of all attributes R = k number of attributes in R om(r) = {0, 1} k attribute space X, Y, Z R attribute sets om(x) = {0, 1} l om(r) domain of X, X = l (X = x) = {(A 1 = a 1 ),..., (A l = a l )} event, X = l t = {A 1 = t(a 1 ),..., A k = t(a k )} row r = {t 1,..., t n t i om(r)} relation (data set) r = n size of relation r σ X=x (r) = {t r t[x] = x} set of rows where X = x m(x = x) = σ X=x (r) absolute frequency of X = x P (X = x) = m(x=x) n relative frequency of X = x (X = x) (A = a) association rule X A association rule containing negations only X A in the consequent P (X A) = P (XA) n frequency of the rule P (A X) = P (XA) confidence of the rule P (X) P (XA) P (X)P (A) γ(x, A) = UB(M) LB(M) Bestrule(X) = arg max A X {M(X \ A A)} Mina(X) = arg min{p (A) A X} P S(X) Red(X) Minimal(X) Signif icant(x) lift of the rule an upper bound for M a lower bound for M the best rule which can be derived from X the least frequent attribute in X is X potentially significant? P S(X) = 1, if UB(M(Bestrule(X))) min M is X redundant? Red(X) = 1, if Z X M(Bestrule(Z)) M(Bestrule(X)) is X minimal? Minimal(X) = 1, if Y X M(Bestrule(Y )) M(Bestrule(X)) is X significant? Signif icant(x) = 1, if M(Bestrule(X)) min M

11 Table 4: ata sets, test parameters, and average execution time in seconds. The natural logarithm of the maximum p-value, max lnp, was used in FishApriori and minimum frequency threshold, min fr, in the traditional Apriori. For Apriori, search and post processing times are given separately. FishApriori TradApriori data n k min cf max lnp time min fr time 1a Mushroom b Mushroom a Chess < b Chess < a T10I4100K b T10I4100K s 4a T40I10100K b T40I10100K a Plants b Plants a Garden b Garden Table 5: Quality of 100 best rules found by FishApriori and the traditional Apriori. The parameters are: average lift, z-score of the difference in lift, average prediction error, frequency, confidence, rule length (number of attributes in the antecedent), and for FishApriori, the number of negative rules. FishApriori ln(p) Apriori min fr +χ 2 test lift z γ err fr cf len #neg lift z γ err fr cf len 1a b a b a b a b a b a b

Guaranteeing the Accuracy of Association Rules by Statistical Significance

Guaranteeing the Accuracy of Association Rules by Statistical Significance Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge

More information

Efficient discovery of statistically significant association rules

Efficient discovery of statistically significant association rules fficient discovery of statistically significant association rules Wilhelmiina Hämäläinen Department of Computer Science University of Helsinki Finland whamalai@cs.helsinki.fi Matti Nykänen Department of

More information

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen Book Title Encyclopedia of Machine Learning Chapter Number 00403 Book CopyRight - Year 2010 Title Frequent Pattern Author Particle Given Name Hannu Family Name Toivonen Suffix Email hannu.toivonen@cs.helsinki.fi

More information

Efficient discovery of the top-k optimal dependency rules with Fisher s exact test of significance

Efficient discovery of the top-k optimal dependency rules with Fisher s exact test of significance Efficient discovery of the top-k optimal dependency rules with Fisher s exact test of significance Wilhelmiina Hämäläinen epartment of omputer Science University of Helsinki Finland whamalai@cs.helsinki.fi

More information

CS 584 Data Mining. Association Rule Mining 2

CS 584 Data Mining. Association Rule Mining 2 CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M

More information

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise Guimei Liu 1,2 Jinyan Li 1 Limsoon Wong 2 Wynne Hsu 2 1 Institute for Infocomm Research, Singapore 2 School

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions. Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

Association Rules. Fundamentals

Association Rules. Fundamentals Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

Removing trivial associations in association rule discovery

Removing trivial associations in association rule discovery Removing trivial associations in association rule discovery Geoffrey I. Webb and Songmao Zhang School of Computing and Mathematics, Deakin University Geelong, Victoria 3217, Australia Abstract Association

More information

Association Analysis Part 2. FP Growth (Pei et al 2000)

Association Analysis Part 2. FP Growth (Pei et al 2000) Association Analysis art 2 Sanjay Ranka rofessor Computer and Information Science and Engineering University of Florida F Growth ei et al 2 Use a compressed representation of the database using an F-tree

More information

Mining Positive and Negative Fuzzy Association Rules

Mining Positive and Negative Fuzzy Association Rules Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing

More information

Σοφια: how to make FCA polynomial?

Σοφια: how to make FCA polynomial? Σοφια: how to make FCA polynomial? Aleksey Buzmakov 1,2, Sergei Kuznetsov 2, and Amedeo Napoli 1 1 LORIA (CNRS Inria NGE Université de Lorraine), Vandœuvre-lès-Nancy, France 2 National Research University

More information

Statistically Significant Dependencies for Spatial Co-location Pattern Mining and Classification Association Rule Discovery.

Statistically Significant Dependencies for Spatial Co-location Pattern Mining and Classification Association Rule Discovery. Statistically Significant Depencies for Spatial Co-location Pattern Mining and Classification Association Rule Discovery by Jundong Li A thesis submitted in partial fulfillment of the requirements for

More information

Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - 1DL36 Fall 212" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht12 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 04 Association Analysis Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei And slides provide by Raymond

More information

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization

More information

Chapter 4: Frequent Itemsets and Association Rules

Chapter 4: Frequent Itemsets and Association Rules Chapter 4: Frequent Itemsets and Association Rules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision

More information

On Minimal Infrequent Itemset Mining

On Minimal Infrequent Itemset Mining On Minimal Infrequent Itemset Mining David J. Haglin and Anna M. Manning Abstract A new algorithm for minimal infrequent itemset mining is presented. Potential applications of finding infrequent itemsets

More information

Efficient search for statistically significant dependency rules in binary data

Efficient search for statistically significant dependency rules in binary data Department of Computer Science Series of Publications A Report A-2010-2 Efficient search for statistically significant dependency rules in binary data Wilhelmiina Hämäläinen To be presented, with the permission

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - DL360 Fall 200 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht0 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Investigating Measures of Association by Graphs and Tables of Critical Frequencies

Investigating Measures of Association by Graphs and Tables of Critical Frequencies Investigating Measures of Association by Graphs Investigating and Tables Measures of Critical of Association Frequencies by Graphs and Tables of Critical Frequencies Martin Ralbovský, Jan Rauch University

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population

More information

CS 484 Data Mining. Association Rule Mining 2

CS 484 Data Mining. Association Rule Mining 2 CS 484 Data Mining Association Rule Mining 2 Review: Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due

More information

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH M. De Cock C. Cornelis E. E. Kerre Dept. of Applied Mathematics and Computer Science Ghent University, Krijgslaan 281 (S9), B-9000 Gent, Belgium phone: +32

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border Guimei Liu a,b Jinyan Li a Limsoon Wong b a Institute for Infocomm Research, Singapore b School of

More information

Selecting a Right Interestingness Measure for Rare Association Rules

Selecting a Right Interestingness Measure for Rare Association Rules Selecting a Right Interestingness Measure for Rare Association Rules Akshat Surana R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

More information

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Association Rule. Lecturer: Dr. Bo Yuan. LOGO Association Rule Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Overview Frequent Itemsets Association Rules Sequential Patterns 2 A Real Example 3 Market-Based Problems Finding associations

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Sánchez 18th September 2013 Detecting Anom and Exc Behaviour on

More information

Data mining, 4 cu Lecture 5:

Data mining, 4 cu Lecture 5: 582364 Data mining, 4 cu Lecture 5: Evaluation of Association Patterns Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Evaluation of Association Patterns Association rule algorithms

More information

Alternative Approach to Mining Association Rules

Alternative Approach to Mining Association Rules Alternative Approach to Mining Association Rules Jan Rauch 1, Milan Šimůnek 1 2 1 Faculty of Informatics and Statistics, University of Economics Prague, Czech Republic 2 Institute of Computer Sciences,

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

Describing Data Table with Best Decision

Describing Data Table with Best Decision Describing Data Table with Best Decision ANTS TORIM, REIN KUUSIK Department of Informatics Tallinn University of Technology Raja 15, 12618 Tallinn ESTONIA torim@staff.ttu.ee kuusik@cc.ttu.ee http://staff.ttu.ee/~torim

More information

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Juliano Brito da Justa Neves 1 Marina Teresa Pires Vieira {juliano,marina}@dc.ufscar.br Computer Science Department

More information

Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength

Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength Xiangjun Dong School of Information, Shandong Polytechnic University Jinan 250353, China dongxiangjun@gmail.com

More information

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation 12.3.2008 Lauri Lahti Association rules Techniques for data mining and knowledge discovery in databases

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules

Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules International Journal of Innovative Research in Computer Scien & Technology (IJIRCST) Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules Mir Md Jahangir

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Association

More information

Frequent Pattern Mining: Exercises

Frequent Pattern Mining: Exercises Frequent Pattern Mining: Exercises Christian Borgelt School of Computer Science tto-von-guericke-university of Magdeburg Universitätsplatz 2, 39106 Magdeburg, Germany christian@borgelt.net http://www.borgelt.net/

More information

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli

More information

732A61/TDDD41 Data Mining - Clustering and Association Analysis

732A61/TDDD41 Data Mining - Clustering and Association Analysis 732A61/TDDD41 Data Mining - Clustering and Association Analysis Lecture 6: Association Analysis I Jose M. Peña IDA, Linköping University, Sweden 1/14 Outline Content Association Rules Frequent Itemsets

More information

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms DATA MINING LECTURE 4 Frequent Itemsets, Association Rules Evaluation Alternative Algorithms RECAP Mining Frequent Itemsets Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Interesting Patterns. Jilles Vreeken. 15 May 2015

Interesting Patterns. Jilles Vreeken. 15 May 2015 Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the Day What is interestingness? what is a pattern? and how can we mine interesting patterns? What is a pattern? Data Pattern y = x - 1 What

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information

More information

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017) Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar (modified by Predrag Radivojac, 27) Association Rule Mining Given a set of transactions, find rules that will predict the

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

NetBox: A Probabilistic Method for Analyzing Market Basket Data

NetBox: A Probabilistic Method for Analyzing Market Basket Data NetBox: A Probabilistic Method for Analyzing Market Basket Data José Miguel Hernández-Lobato joint work with Zoubin Gharhamani Department of Engineering, Cambridge University October 22, 2012 J. M. Hernández-Lobato

More information

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu Why? How? What? How much? How many? Individual facts (quantities, characters, or symbols) The Data-Information-Knowledge-Wisdom

More information

Effective Elimination of Redundant Association Rules

Effective Elimination of Redundant Association Rules Effective Elimination of Redundant Association Rules James Cheng Yiping Ke Wilfred Ng Department of Computer Science and Engineering The Hong Kong University of Science and Technology Clear Water Bay,

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

Maintaining Frequent Itemsets over High-Speed Data Streams

Maintaining Frequent Itemsets over High-Speed Data Streams Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,

More information

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries To appear in Data Mining and Knowledge Discovery, an International Journal c Kluwer Academic Publishers

More information

The Challenge of Mining Billions of Transactions

The Challenge of Mining Billions of Transactions Faculty of omputer Science The hallenge of Mining illions of Transactions Osmar R. Zaïane International Workshop on lgorithms for Large-Scale Information Processing in Knowledge iscovery Laboratory ata

More information

Chapters 6 & 7, Frequent Pattern Mining

Chapters 6 & 7, Frequent Pattern Mining CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapters

More information

Quantitative Association Rule Mining on Weighted Transactional Data

Quantitative Association Rule Mining on Weighted Transactional Data Quantitative Association Rule Mining on Weighted Transactional Data D. Sujatha and Naveen C. H. Abstract In this paper we have proposed an approach for mining quantitative association rules. The aim of

More information

A Concise Representation of Association Rules using Minimal Predictive Rules

A Concise Representation of Association Rules using Minimal Predictive Rules A Concise Representation of Association Rules using Minimal Predictive Rules Iyad Batal and Milos Hauskrecht Department of Computer Science University of Pittsburgh {iyad,milos}@cs.pitt.edu Abstract. Association

More information

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets CS-C3160 - Data Science Chapter 8: Discrete methods for analyzing large binary datasets Jaakko Hollmén, Department of Computer Science 30.10.2017-18.12.2017 1 Rest of the course In the first part of the

More information

Mining Infrequent Patter ns

Mining Infrequent Patter ns Mining Infrequent Patter ns JOHAN BJARNLE (JOHBJ551) PETER ZHU (PETZH912) LINKÖPING UNIVERSITY, 2009 TNM033 DATA MINING Contents 1 Introduction... 2 2 Techniques... 3 2.1 Negative Patterns... 3 2.2 Negative

More information

Standardizing Interestingness Measures for Association Rules

Standardizing Interestingness Measures for Association Rules Standardizing Interestingness Measures for Association Rules arxiv:138.374v1 [stat.ap] 16 Aug 13 Mateen Shaikh, Paul D. McNicholas, M. Luiza Antonie and T. Brendan Murphy Department of Mathematics & Statistics,

More information

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees) Decision Trees Lewis Fishgold (Material in these slides adapted from Ray Mooney's slides on Decision Trees) Classification using Decision Trees Nodes test features, there is one branch for each value of

More information

Pattern Space Maintenance for Data Updates. and Interactive Mining

Pattern Space Maintenance for Data Updates. and Interactive Mining Pattern Space Maintenance for Data Updates and Interactive Mining Mengling Feng, 1,3,4 Guozhu Dong, 2 Jinyan Li, 1 Yap-Peng Tan, 1 Limsoon Wong 3 1 Nanyang Technological University, 2 Wright State University

More information

Chapter 2 Quality Measures in Pattern Mining

Chapter 2 Quality Measures in Pattern Mining Chapter 2 Quality Measures in Pattern Mining Abstract In this chapter different quality measures to evaluate the interest of the patterns discovered in the mining process are described. Patterns represent

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Chapter 5 Association Analysis: Basic Concepts Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 2/3/28 Introduction to Data Mining Association Rule Mining Given

More information

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion Outline Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Introduction Algorithm Apriori Algorithm AprioriTid Comparison of Algorithms Conclusion Presenter: Dan Li Discussion:

More information

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional -Series Data Xiaolei Li, Jiawei Han University of Illinois at Urbana-Champaign VLDB 2007 1 Series Data Many applications produce time series

More information

Connections between mining frequent itemsets and learning generative models

Connections between mining frequent itemsets and learning generative models Connections between mining frequent itemsets and learning generative models Srivatsan Laxman Microsoft Research Labs India slaxman@microsoft.com Prasad Naldurg Microsoft Research Labs India prasadn@microsoft.com

More information

CSC 411 Lecture 3: Decision Trees

CSC 411 Lecture 3: Decision Trees CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 03-Decision Trees 1 / 33 Today Decision Trees Simple but powerful learning

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING, LIG M2 SIF DMV course 207/208 Market basket analysis Analyse supermarket s transaction data Transaction = «market basket» of a customer Find which items are

More information

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Claudio Lucchese Ca Foscari University of Venice clucches@dsi.unive.it Raffaele Perego ISTI-CNR of Pisa perego@isti.cnr.it Salvatore

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

Association Rule Mining on Web

Association Rule Mining on Web Association Rule Mining on Web What Is Association Rule Mining? Association rule mining: Finding interesting relationships among items (or objects, events) in a given data set. Example: Basket data analysis

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci Association Rules Information Retrieval and Data Mining Prof. Matteo Matteucci Learning Unsupervised Rules!?! 2 Market-Basket Transactions 3 Bread Peanuts Milk Fruit Jam Bread Jam Soda Chips Milk Fruit

More information

Chapter 3 Deterministic planning

Chapter 3 Deterministic planning Chapter 3 Deterministic planning In this chapter we describe a number of algorithms for solving the historically most important and most basic type of planning problem. Two rather strong simplifying assumptions

More information

Frequent Pattern Mining. Toon Calders University of Antwerp

Frequent Pattern Mining. Toon Calders University of Antwerp Frequent Pattern Mining Toon alders University of ntwerp Summary Frequent Itemset Mining lgorithms onstraint ased Mining ondensed Representations Frequent Itemset Mining Market-asket nalysis transaction

More information

arxiv: v1 [cs.db] 31 Dec 2011

arxiv: v1 [cs.db] 31 Dec 2011 Mining Flipping Correlations from Large Datasets with Taxonomies MarinaBarsky SangkyumKim TimWeninger JiaweiHan Univ. ofvictoria,bc,canada,marina barsky@gmail.com Univ.ofIllinoisatUrbana-Champaign, {kim71,weninger1,hanj}@illinois.edu

More information

Application of Apriori Algorithm in Open Experiment

Application of Apriori Algorithm in Open Experiment 2011 International Conference on Computer Science and Information Technology (ICCSIT 2011) IPCSIT vol. 51 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V51.130 Application of Apriori Algorithm

More information

Inference of A Minimum Size Boolean Function by Using A New Efficient Branch-and-Bound Approach From Examples

Inference of A Minimum Size Boolean Function by Using A New Efficient Branch-and-Bound Approach From Examples Published in: Journal of Global Optimization, 5, pp. 69-9, 199. Inference of A Minimum Size Boolean Function by Using A New Efficient Branch-and-Bound Approach From Examples Evangelos Triantaphyllou Assistant

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Tan, Steinbach, Kumar And Jiawei Han, Micheline Kamber, and Jian Pei 1 10

More information