Efficient discovery of statistically significant association rules

Size: px
Start display at page:

Download "Efficient discovery of statistically significant association rules"

Transcription

1 fficient discovery of statistically significant association rules Wilhelmiina Hämäläinen Department of Computer Science University of Helsinki Finland Matti Nykänen Department of Computer Science University of Kuopio Finland Abstract Searching statistically significant association rules is an important but neglected problem. Traditional association rules do not capture the idea of statistical dependence and the resulting rules can be spurious, while the most significant rules may be missing. This leads to erroneous models and predictions which often become expensive. The problem is computationally very difficult, because the significance is not a monotonic property. However, in this paper we prove several other properties, which can be used for pruning the search space. The properties are implemented in the StatApriori algorithm, which searches statistically significant, non-redundant association rules. Based on both theoretical and empirical observations, the resulting rules are very accurate compared to traditional association rules. In addition, StatApriori can work with extremely low frequencies, thus finding new interesting rules. 1. Introduction Traditional association rules [1] are rules of form if event X = x occurs, then also event A = a is likely to occur. The commonness of the rule is measured by frequency P (X = x, A = a) and the strength of the rule by confidence P (A = a X = x). For computational purposes it is required that both frequency and confidence should exceed some user-defined thresholds. The actual interestingness of the rule is usually decided afterwards, by some interestingness measure. Often the associations are interpreted as correlations or dependencies between certain attribute value combinations. However, traditional association rules do not necessarily capture statistical dependencies, but they can associate absolutely independent events while ignoring strong dependencies. As a solution, it is often suggested (following the axioms by Piatetsky-Shapiro [18]) to measure the lift (interest) instead of the confidence (e.g. [21]). This produces also statistically more sound results, but still it is possible to find spurious rules while missing statistically significant rules. Often, these two error types are called type 1 and type 2 errors (in computer science terms, false positive and false negative). In the worst case, all discovered rules can be spurious [23, 24]. In practice, this means that the future data does not exhibit the discovered dependencies and the conclusions based on them are erroneous. The results can be expensive or even fatal, as the following example demonstrates. xample 1. A biological database contains observation reports from different kinds of biotopes, like grove, marsh, waterside, coniferous forest, etc. For association analysis, each report is represented as a binary vector, listing the observed species, along with biotope characteristics. Local forestry societies as well as individual land owners can use the data when they decide e.g. fellings or protected sites. The forestry society FallAll is going to drain swamps for new forests. Before any decisions are made, they search associations from the 1000 observations on marsh sides. They use minimum frequency 0.05 and minimum confidence One discovered rule is leather leaf cloudberry with frequency 0.06 and confidence Since cloudberries are commercially important product, the forestry society decides to protect a marsh growing leather leaves, when other swamps are drained. The decision is excellent for the leather leaf, but all cloudberries in the area disappear. The reason is that cloudberries require a wet swamp, while leather leaves can grow in both moist and wet sides. The only protected swamp in the area was too dry for cloudberries. This catastrophe was due to a spurious rule leather leaf cloudberry. The rule has p-value 0.13 which means that there 13% probability to make a type 1 error. In the same time, the forest society misses an important rule, namely wet swamp,leather leaf cloudberry. This rule was not found, because it had too low frequency, However, it is a strong rule with confidence 1.0. The p-

2 value is which indicates that the rule is quite reliable. Roughly speaking, it means that there is only 1.1% probability that the rule is spurious. The problems of association rules and especially the frequency-confidence-framework are well-known ([23, 24, 6, 16]), but still there have been only few attempts to solve the problem. Quite likely, the reason is purely practical: the problem has been considered computationally intractable. Statistical significance is not a monotonic property and therefore it cannot be used for pruning the search space in the same manner as the frequency. However, when we search directly statistically significant rules (instead of sets), we can utilize other properties for efficient pruning. More efficiency is achieved by searching only minimal (non-redundant) statistically significant rules. Such rules are at least as good as the pruned rules, but simpler, and no information is lost. In practice, the simpler rules avoid overfitting and hold better in the future data. In this paper, we introduce a set of properties which can be used for searching minimal, statistically most significant association rules. The properties are implemented in the StatApriori algorithm. StatApriori guarantees that no significant rules are missed, while the number of spurious rule candidates generated during the execution is kept minimal. Compared to a modification of the classical Apriori algorithm, which also produces all significant association rules, StatApriori is very efficient. It can tackle problems which are impossible to compute with the classical Apriori. As far as we know, the algorithm is the first of its kind. The previous algorithms have restricted in statistically significant classification rules using χ 2 -measure (e.g. [15, 16, 4, 22, 17]). This is quite a different problem, because in classification both X C and X C should be accurate. In other words, classification rules describe dependencies between attributes, while association rules describe dependencies between events. Webb [24] has done pioneering work in testing the statistical significance of association rules by Fisher s exact test. Fisher s exact test like χ 2 is designed for measuring dependence between attributes and some significant association rules can be missed (type 2 error). However, no new algorithms were introduced in these experiments and the search proved to be infeasible on large data sets with the existing techniques. In addition, several interestingness measures (see e.g. [9] for an overview) have their origins in statistics, but they do not measure the statistical significance of association rules. As a result, they can cause both type 1 and type 2 errors. The organization of the paper is the following: The basic definitions are given in Section 2. The main principles of the search are given in Section 3 and the StatApriori algorithm in Section 4. xperimental results are reported in Section 5 and the final conclusions are drawn in Section Basic definitions In the following we give basic definitions of the association rule, statistical dependence, statistical significance, and redundancy. The notations are introduced in Table 1. Table 1. Basic notations. Notation Meaning A, B, C,... binary attributes a, b, c,... {0, 1} attribute values R = {A 1,..., A k } set of all attributes R = k number of attributes Dom(R) = {0, 1} k attribute space X, Y, Z R attribute sets Dom(X) = {0, 1} l domain of X, X = l (X = x) = {(A 1 = a 1 ),..., event, X = l (A l = a l )} t = {A 1 = t(a 1 ),..., A k = t(a k )} row (tuple, transaction) r = {t 1,..., t n t i Dom(R)} relation (data set) r = n size of relation r σ X=x (r) = {t r t[x] = x} set of rows where X = x m(x = x) = σ X=x (r) number of rows where X = x P (X = x) = m(x=x) relative frequency of n X = x i(fr, γ) measure functions I(X A) = i(p (XA), γ(x, A)) upperbound(f) an upperbound for function f bestrule(x) = the best rule which can be arg max A X {I(X \ A A)} constructed from X P S(X) property potentially significant ; whether significant rules can be derived from X or its supersets minattr(x) = minimum attribute of X; arg min{p (A i ) A i X} one with the lowest frequency 2.1. Association rules Traditionally, association rules are defined in the frequency-confidence framework: Definition 1 (Association rule). Let R be a set of binary attributes and r a relation according to R. Let X R, A R \ X, x Dom(X), and a Dom(A). The confidence of rule (X = x) (A = a) is cf(x = x A = a) = P (X = x, A = a) P (X = x) = P (A = a X = x) and the frequency (support) of the rule is fr(x = x A = a) = P (X = x, A = a). Given user-defined thresholds min cf, min fr [0, 1], rule (X = x) (A = a) is an association rule in r, if 2

3 (i) cf(x = x A = a) min cf, and (ii) fr(x = x A = a) min fr. The first condition requires that an association rule should be strong enough and the second condition requires that it should be common enough. In this paper, we call rules association rules, even if no thresholds min fr and min cf are specified. Usually, it is assumed that the rule contains only positive attribute values (A i = 1). Now, the rule can be expressed simply by listing the attributes, e.g. A 1, A 3, A 5 A Statistical dependence Statistical dependence is usually defined through statistical independence (e.g. [20, 14]): Definition 2 (Independence and dependence). Let X R and A R \ X be sets of binary attributes. vents X = x and A = a, x Dom(X), a Dom(A), are mutually independent, if P (X = x, A = a) = P (X = x)p (A = a). If the events are not independent, they are dependent. The strength of the statistical dependence between (X = x) and (A = a) can be measured by lift or interest: γ(x = x, A = a) = P (X = x, A = a) P (X = x)p (A = a). In the following, we will concentrate on the dependencies between events containing only positive attributes. The lift of rule X A is denoted simply γ(x, A) Statistical significance In this work, we analyze the statistical significance of association rules in the classical (frequentist) framework. Bayesian significance testing offers an interesting alternative, but it is still little studied in this context. Both approaches produce asymptotically similar results (under some assumptions, the test results are identical), although the Bayesian testing is sensitive to the selected prior probabilities. [3] The idea of classical statistical significance tests (see e.g. [8, Ch. 26] or [12, Ch. 10.1]) is to estimate the probability of the observed or a rarer phenomenon, under some null hypothesis. When the objective is to test the significance of the dependency between events X and A, the null hypothesis is the independence assumption: P (X, A) = P (X)P (A). The task is to calculate the probability p that the observed or higher frequency occurs in the data, if the events were actually independent. If the estimated probability p is very small, then the observed dependency is said to be significant at level p. The significance of the observed frequency m(x, A) can be estimated exactly by the binomial distribution. ach row in relation r, r = n, corresponds to an independent Bernoulli trial, whose outcome is either 1 (XA occurs) or 0 (XA does not occur). All rows are mutually independent. Assuming the independence of attributes X and A, combination XA occurs on a row with probability P (X)P (A). Now the number of rows containing X, A is a binomial random variable M with parameters P (X)P (A) and n. The mean of M is µ M = np (X)P (A) and its variance is σm 2 = np (X)P (A)(1 P (X)P (A)). Probability P (M m(x, A)) gives the significance p: p = m(x) i=m(x,a) ( n i ) (P (X)P (A)) i (1 P (X)P (A)) n i. The significance can be used in two ways to prune association rules: either 1) we can set the significance level (maximum p) and search all rules with sufficiently low p, or 2) use the p-values to search the K most significant rules. Deciding the required significance level is a difficult problem, which we do not try to solve here. The problem is that the more rules we test, the more spurious rules are likely to pass the significance test. Webb [24] has suggested a solution in the context of association rule discovery using the Bonferroni adjustment [19] The z-score The binomial probability is quite difficult to calculate, but for our purposes it is enough to have an upper bound for the p-value. This guarantees that no rules with a low p- value are lost when the search space is pruned. Additional pruning and ranking can be done afterwards, when the actual binomial probabilities are calculated. The simplest upper bound is based on the (binomial) z- score: z(x, A) = m(x, A) µ M = = σ M m(x, A) np (X)P (A) np (X)P (A)(1 P (X)P (A)) n(γ(x, A) 1) γ(x, A) P (X, A). The z-score measures how many standard deviations (σ M ) the observed frequency m(x, A) deviates from the expected value µ M = np (X)P (A). The corresponding probability can be easily approximated, because z follows the standard normal distribution, when n is sufficiently large and P (X)P (A) (or 1 P (X)P (A)) is neither close to 0 nor to 1. As a rule of thumb, the approximation can be used, 3

4 when np (X)P (A) 5 (e.g. [12, 147]). According to [7], the approximation works well even for np (X)P (A) 2, if continuity correction (subtracting 0.5 from m(x, A)) is used. When P (X)P (A) is low, the binomial distribution is positively skewed. This means that the z-score overestimates the significance. Therefore, we will not use normal approximation to estimate the p-values, but the z-score is used only as a measure function. We note that the z-score is not crucial to our method, but several other measure functions can be used, as well. The requirement is that the measure I is a monotonically increasing or decreasing function of m(x, A) and γ(x, A). For example, when the expected value P (X)P (A) is very low, we can derive a tight upperbound for p from the Chernoff bound [10]: e δµ M P (M > µ M (1 + δ) <. (1 + δ) (1+δ)µ M By inserting δ = γ 1, where γ = γ(x, A), and using γµ = m(x, A), we achieve p ch = P (M > m(x, A)) < (e (γ 1) γ γ ) m(x,a). This is monotonically decreasing with both m(x, A) and γ Redundancy A common goal in association rule discovery is to find minimal (or most general) interesting rules, and prune out redundant rules [5]. The reasons are twofold: First, the number of discovered rules is typically too large (even hundreds of thousand rules) for any human interpreter. According to the Occam s Razor principle, it is only sensible to prune out complex rules X A, if their generalizations Z A, Z X are at least equally interesting. The user just has to define the interestingness measure carefully, according to the modelling purposes. Second, pruning redundant rules can save the search time enormously, if it is done on-line. This is not possible with many interestingness functions, and usually the pruning is done afterwards. In our case, the interestingness measure is the statistical significance, but in general, redundancy and minimality can be defined with respect to any other measure function. Definition 3 (Redundant rules). Given an increasing interestingness measure I, rule X A is redundant, if there exists rule X A such that X {A } X {A} and I(X A ) I(X A). If the rule is not redundant, then it is called non-redundant. I.e. a rule is non-redundant, if all its generalizations ( parent rules ) are less significant. It is still possible that some or all of its specializations ( children rules ) are better. In the latter case, the rule is unlikely interesting itself. Non-redundant rules can be further classified as minimal or non-minimal: Definition 4 (Minimal rules). Non-redundant rule X A is minimal, if for all rules X A, such that X {A } X {A}, I(X A) I(X A ). I.e. a minimal rule is more significant than any of its parent or children rules. In the algorithmic level this means that we stop the search without checking any children rules, if we have just ensured that the rule is minimal. 3. Main principles In this section, we will introduce the main principles of the search algorithm. The results are given on such a general level that any suitable measure function or search strategy can be applied Problem definition Let us first define the problem formally: Definition 5 (Search problem). Let p(x A) denote the p-value of rule X A. Given binary data r and threshold p max R, the problem is to search a set of association rules S such that for all X A S 1. X A expresses a positive correlation, i.e. γ(x A) > 1, 2. X A is non-redundant, 3. for all Y B / S, p(x A) p(y B), and 4. p(x A) p max. We note that the user has to select only one parameter, p max. Alternatively, we could define an optimization problem, where the N best rules (with lowest p-values) are searched. Let us now assume that we have a measure function i(f r, γ) which defines an upperbound for the binomial probability. For any rule X A, I(X A) = i(p (XA), γ(x, A)). In addition, let i be either monotonically increasing or decreasing with both frequency f r and lift γ. The search problem can be divided into two subproblems: 1. Search all non-redundant rules X A for which p(x A) p max using i. 4

5 Figure 1. The general Apriori algorithm. Input: set of attributes R, data set r, an anti-monotonic property π Output: {X P(R) π(x) = 1} Method: // Initialization S 1 = {A i R π(a i ) = 1} l = 1 while (S l ) // Step 1: Candidate generation Generate C l+1 from S l // Step 2: Pruning S l+1 = {c C l+1 π(c) = 1} l = l + 1 return l S l 2. Calculate the exact p-values and output rules with sufficiently low p. The postprocessing step is trivial and we will concentrate on only the search step. For simplicity, we assume that i is monotonically increasing; a monotonically decreasing measure function is handled similarly. The measure function I guarantees that all significant rules at level p max are discovered. For efficiency, i should also prune out as many insignificant or redundant rules as possible Monotonic and anti-monotonic properties The key idea of the classical Apriori algorithm [2, 13] is the anti-monotonicity of frequency. For attribute sets, the monotonicity and anti-monotonicity are defined as follows: Definition 6 (Monotonic and anti-monotonic properties). Property π : P(R) {0, 1} is monotonic, if (π(y ) = 1) (π(x) = 1) for all X Y, and anti-monotonic, if (π(x) = 1) (π(y ) = 1) for all Y X. When π is anti-monotonic (π(y ) = 0) (π(x) = 0) for all X Y. When the measure function defines an anti-monotonic property, the interesting sets or rules can be searched with the general Apriori algorithm (Figure 1). The problem is that the measure functions for the statistical significance do not define any anti-monotonic property. However, it turns out that the upper-bound for the measure function I defines an anti-monotonic property for most set-inclusion relations Property P S Let us define the property P S, potentially significant. Potential significance of set X is a necessary condition for constructing any significant rule X \ A A. Definition 7. Let measure function I be as before, min I a user-defined threshold, and upperbound(f) an upperbound for function f. Let bestrule(x) = arg max A X {I(X \ A A)} be the best rule which can be constructed from attributes X. Property P S : P(R) {0, 1} is defined as P S(X) = 1, iff upperbound(i(bestrule(x))) min I. Now it is enough to define the conditions under which P S behaves anti-monotonically. The following theorem is the core of the whole search algorithm: Theorem 1. Let P S, X, and Y be as before. If (P S(X) = 1), then (P S(Y ) = 1) for all Y X such that minattr(x) = minattr(y ). Proof. First observe that for all A X we have γ(x \ A, A)) 1 P (A) 1 P (minattr(x)) and upperbound(i(x \ 1 A A)) = i(p (X), P (minattr(x)) ). By them we have upperbound(i(bestrule(x))) = 1 i(p (X), P (minattr(x)) ) i(p (Y ), 1 P (minattr(y )) ) = upperbound(i(bestrule(y ))) for all Y X such that minattr(x) = minattr(y ). We have min I upperbound(i(bestrule(x))) by the definition of P S(X) = 1. Hence the reasoning above yields also min I upperbound(i(bestrule(y ))), as required for the definition of P S(Y ) = 1. Corollary 1. If P S(Y ) = 0, then P S(X) = 0 for all X Y such that minattr(x) = minattr(y ). We have shown that property P S defines an antimonotonic property among sets having the same minimum attribute. Let us now consider the exceptional case, when the anti-monotonicity does not hold. Let us call X an l-set, if X = l. Let the (l 1)-subsets, Y i X, Y i = l 1, be called X s parent sets. Now X has l parent sets, from which only one has a different minimum attribute. The exceptional parent set is Y l = X \ {minattr(x)}. If P (minattr(y l )) > P (minattr(x)), Y l has a lower upperbound for γ than Y 1,..., Y l 1 and X have. Therefore, it is possible that Y l is non-p S, even if X is P S. 4. Algorithm Next, we give the StatApriori algorithm, which implements the pruning properties. 5

6 4.1. The main idea The StatApriori algorithm proceeds like the general Apriori (Figure 1) using property P S. It alternates between the candidate generation and pruning steps, as long as new non-redundant, potentially significant rules can be found. However, special techniques are needed, because property P S is not anti-monotonic in all respects. First, the attributes are arranged into an ascending order by their frequencies. Let the renamed attributes be {A 1,..., A k }, where P (A 1 )... P (A k ). The idea is that the candidates are generated in the canonical order. From l-set X = {A 1,..., A l }, we can generate (l + 1)-sets X {A j }, where j > l. Now all supersets of X have the same upper-bound for the lift, γ 1 P (A 1 ). If X is non-p S, then none of its descendants can be P S. Otherwise, we should check the other parent sets Z X {A j }, Z = l. If at most one of them, (X \ {A 1 }) {A j }, is non-p S, then X {A j } is added to the candidate collection C l+1. If (X \ {A 1 }) {A j } was non-p S, it is also added to a temporary collection of special parents for frequency counting. After candidate generation, the exact frequencies are counted from the data. Candidates which are non-p S or can produce only redundant descendants, will be pruned, and others are added to collection S l+1. The minimality of P S rules is checked, because no new candidates are generated from minimal rules. The principles for redundancy and minimality checking are 1. If we have γ(bestrule(x)) = 1 P (minattr(x)), then the lift is already maximal possible, and none of X s specializations can gain a better p-value. The rule is marked as minimal. 2. If we have upperbound(i(bestrule(x {A j }))) i(bestrule(z)) for some attributes Z X {A j }, then X {A j } and all its specializations will be redundant with respect to Z. X {A j } is removed numeration tree The secret of StatApriori is a special kind of enumeration tree, which enables an efficient implementation of pruning principles. A complete enumeration tree lists all sets in the powerset P(R). In practice, it can be implemented as a trie, where each root node path corresponds an item set. StatApriori uses an ordered enumeration tree, where the attributes are arranged into ascending order by their frequencies. Figure 2) shows an example, when R = {A, B, C, D, }, and P (A) P (B)... P (). Solid lines represent an example tree, when two levels are generated, and dash lines show the nodes missing from a complete tree. Let us now consider the candidate generation at the third level. The missing nodes at the second level are either insignificant or they and all their descendants are redundant. Set {A, B, C} is added to the tree, because all its parent sets, {A, B}, {A, C}, and {B, C} are in the tree (i.e. P S) and non-minimal. Sets {A, B, D} and {A, B, } are not added, because they have non-p S parents (missing sets {A, D} and {A, }) with the same minimum attribute A. The same holds for {A, C, D} and {A, C, }. However, {B, C, D} is added to the tree, because the only missing parent, {C, D}, has a different minimum attribute. This is a special case, and also {C, D} is added to a temporary collection for frequency counting. Sets {B, C, } and {B, D, } are not added, because {B, } is missing. After frequency counting, non-p S candidates will be removed from the tree. D C B C D D D D A Figure 2. A complete enumeration tree (dash line) and an example tree (solid line) Pseudocode The StatApriori algorithm is given in Figures 3, 4, and 5. In the pseudocode it is assumed that the measure function I for statistical significance is increasing Time complexity It is known that the problem of searching all frequent attribute sets is NP -hard, in the terms of the number of attributes, k [11]. The worst case happens, when the most significant association rule involves all k attributes, and all 2 k attribute sets are generated. The worst case complexity of the algorithm is O(max{k 2, nk}2 k ). Usually, when k < n, this reduces to O(n 2 2 k ). Theorem 2. The worst-case time complexity of StatApriori is O(max{k 2, nk}2 k ), where n is the number of rows and k is the number of attributes. C D B C D D 6

7 Figure 3. Algorithm StatApriori. Input: set of attributes R, data set r, increasing measure function I, threshold K Output: non-redundant rules which are significant Method: // Initialization 1 S 1 = {A i R P S(A i ) = 1} 2 l = 1 3 while (S l ) 4 // Step 1: Candidate generation 5 C l+1 =GenCands(S l, l) 6 // Step 2: Pruning 7 S l+1 =PruneCands(C l+1, l + 1, K) 8 l = l for all X i l S l such that ( X i.redundant) and (X i.max I K) 10 output bestrule(x i ) Figure 5. Algorithm PruneCands. Input: l-candidates C l, size l, threshold K Output: potentially significant l-sets S l Method: 1 S l = ; 2 for all X i C l, Y j SpecP ar 3 count frequencies P (X i ) and P (Y j ) from r 4 for all X i C l 5 max γ = P (minattr(x)) 1 // check if X i is P S and its descendants can // produce non-redundant rules 6 if (PS(P (X i ), max γ, K) and ( Redundant(P (X i ), max γ )) 7 add X i to S l 8 if i(bestrule(x i )) == upperbound(i(bestrule(x i ))) 9 X i.minimal = 1 // check if X i is redundant; its descendants can // still be non-redundant 10 if (Redundant(P (X i ), γ(bestrule(x i )))) 11 X i.redundant = 1 12 X i.max I = max{y j.max I Y j P arents(x i )} 13 return S l ; Figure 4. Algorithm GenCands. Input: potentially significant l-sets S l, size l Output: (l + 1)-candidates C l+1, special parents SpecP ar Method: 1 C l+1 = ; SpecP ar = ; // X i and X j have l 1 common attributes, the same // minimum attribute and none of them is minimal 2 for all X i, X j S l such that (( X i X j == l 1) and Minimal(X i ) and Minimal(X j )) // check the other parents with the same minattr 3 if Z X i X j such that (( Z = l) and (minattr(z) == minattr(x i )) if ( Minimal(Z) and Z S l 1 ) 4 add X i X j to C l+1 // check special case 5 if ((X i X j ) \ minattr(x i ) / S l ) 6 add (X i X j ) \ minattr(x i ) to SpecP ar 7 return (C l+1, SpecP ar) Proof. The initialization (generation of 1-sets) takes n k steps. Producing l-sets and their best rules takes l 2 C l + 2n C l k + 2 S l l time steps. The first term is the time complexity of the candidate generation. ach candidate has l parents and each parent can be found in the trie in l 1 steps. The second term is the complexity of the frequency counting. The database is read (n rows) and on each row C l candidates are checked. In the worst case, all candidates have an extra parent which has to be checked, too. Checking takes in the worst case log k steps, when the data is stored as bit vectors and inclusion test is implemented with logical bit operations. The third term is the complexity of the rule selection phase: for each of S l sets, all l parents are checked. Checking is done at most twice: once for calculating the maximal I-value (selecting the best rule) and second time for checking the redundancy. ach checking can be implemented in constant time, if the parent pointers are stored into a temporary structure in candidate generation phase. Since C l S l, the total complexity is k max{l 2, n} C l < max{k 2, n} l=2 k l=2 ( ) k l = O(max{k 2, nk}2 k ). 7

8 5. xperiments The main goal of the experiments was to evaluate the speed accuracy ratio of the StatApriori algorithm. ven a clever algorithm is worthless unless it can produce better results or perform faster than the existing methods. It was expected that StatApriori cannot compete in speed with the traditional methods, but instead it is likely to produce more accurate rules. The data sets and test parameters are described in Table 2. The first four data sets are classical benchmark data sets from FIMI repository for frequent item set mining (http: //fimi.cs.helsinki.fi/). Plants lists all plant species growing in the U.S.A. and Canada. ach row contains a species and its distribution (states where it grows). The data has been extracted from the USDA plants database ( Garden lists recommended plant combinations. The data is extracted from several gardening sources (e.g. baygardens.tripod.com/). ach data set was tested with two minimum confidences, 0.90 and The goal was to find both strong (and probably accurate) rules and strong correlations. For all tested measures we have calculated the average prediction accuracy (error in the test set) and lift among 100 best rules during 10 executions. All experiments were executed on Intel Core Duo processor T GHz with 1 GB RAM, 2MG cache, and Linux operating system. The quality of rules is summarized in Table 3. In Stat- Apriori, the main measure function was the z-score, but the binomial probabilities (p-values) were used for redundancy reduction, too. A rule was considered redundant, if it had either a lower z-score than its parent rules or if the lower bound of log(p) was higher than the minimum upperbound for parents log(p). This strategy proved to be efficient when the frequencies become low and z-scores expand. For comparison, the rules we selected with the χ 2 - measure, J-measure, z-score and frequency, after normal frequency-based pruning. StatApriori produced very accurate results on all data sets, except Chess and Garden. The latter was difficult for all measures, because all patterns are very rare. For proper analysis, an ontology of Genus Species Subspecies Variety should be used. The poor behaviour on Chess is harder to explain. For all other measures the rules were selected with exceptionally high minimum frequency (min fr = 0.75). This means that the consequence holds on at least 75% of rows and the error is less than 25%, even if the antecedent is empty. In fact, the rules did not represent any correlations, but the consequents were totally independent from the antecedents. In all cases, StatApriori produced the strongest lift. This is understandable, because the statistical significance measures the correlations. When the z-score was used with the minimum frequency thresholds, the lift values were much smaller. The accuracy was also poorer, which suggests that the z-score suffers for frequency-based pruning. Quite likely, the same holds for the chi 2. It is noteworthy that the StatApriori performed faster than the traditional Apriori in all test cases, even if no minimum frequency thresholds were used. The maximum execution time was for Chess, 110s. The large minimum frequencies for the Apriori are partly due to heavy postprocessing. For feasibility, the thresholds were set to avoid over rules. However, the dense data sets are difficult for Apriori even without this restriction. For example, Apriori cannot handle Chess with min fr < Conclusions Searching statistically significant association rules is an important but neglected problem. So far, it has been considered computationally infeasible with any larger data sets. In this paper, we have shown that its is possible to search all statistically significant rules in a reasonable time. We have introduced a set of effective pruning properties and a breadth-first search strategy, StatApriori, which implements them. StatApriori can be used in two ways: either to search K most significant association rules or all rules passing the given significance threshold (minimum z-score). This enables the user to solve the multiple testing problem (i.e. setting the significance threshold) in a desired way or use the algorithm only for ranking the most significant rules. In the same time, StatApriori solves another important problem, and prunes out all redundant association rules. According to experimental results, this improves the rule quality by avoiding overfitting. Together, the z-score and redundancy reduction provide a robust method for rule discovery. I.e. the discovered rules have a high probability to hold in the future data. As far as we know, this is the first algorithm of its kind. The few existing algorithms have either searched only classification rules with statistical measures or used frequencybased pruning in some extent. Both of these strategies are likely to lose significant association rules. In the future research, we are going to improve the efficiency further, using a suitable indexing structure or additional pruning criteria. The final goal is to develop an efficient algorithm for searching the most significant, general association rules, containing propositional logic formulas. 7. Acknowledgments We thank Finnish Concordia Fund ( konkordia-liitto.com/) for supporting this re- 8

9 search. References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages , Washington, D.C., [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB 94, pages Morgan Kaufmann, [3] A. Agresti and Y. Min. Frequentist performance of bayesian confidence intervals for comparing proportions in 2 2 contingency tables. Biometrics, 61:515523, [4]. Baralis and P. Garza. A lazy approach to pruning classification rules. In Proceedings of the 2002 I International Conference on Data Mining (ICDM 02), page 35. I Computer Society, [5] Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal. Mining minimal non-redundant association rules using frequent closed itemsets. In Proceedings of the First International Conference on Computational Logic (CL 00), volume 1861 of Lecture Notes in Computer Science, pages Springer-Verlag, [6] F. Berzal, I. Blanco, D. Sánchez, and M. A. V. Miranda. A new framework to assess association rules. In Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis (IDA 01), volume 2189 of Lecture Notes In Computer Science, pages , London, UK, Springer-Verlag. [7] K. Carriere. How good is a normal approximation for rates and proportions of low incidence events? Communications in Statistics: Simulation and Computation, 30: , [8] D. Freedman, R. Pisani, and R. Purves. Statistics. Norton & Company, London, 4th edition, [9] L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. ACM Computing Surveys, 38(3):9, [10] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:1330, [11] C. Jermaine. Finding the most interesting correlations in a database: how hard can it be? Information Systems, 30(1):21 46, [12] B. Lindgren. Statistical Theory. Chapman & Hall, Boca Raton, U.S.A., 4th edition, [13] H. Mannila, H. Toivonen, and A. Verkamo. fficient algorithms for discovering association rules. In Papers from the AAAI Workshop on Knowledge Discovery in Databases (KDD 94), pages AAAI Press, [14] R. Meo. Theory of dependence values. ACM Transactions on Database Systems, 25(3): , [15] S. Morishita and A. Nakaya. Parallel branch-and-bound graph search for correlated association rules. In Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, volume 1759 of Lecture Notes in Computer Science, pages Springer-Verlag, [16] S. Morishita and J. Sese. Transversing itemset lattices with statistical metric pruning. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS 00), pages ACM Press, [17] S. Nijssen and J. Kok. Multi-class correlated pattern mining. In Proceedings of the 4th International Workshop on Knowledge Discovery in Inductive Databases, volume 3933 of Lecture Notes in Computer Science. Springer-Verlag, [18] G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. Frawley, editors, Knowledge Discovery in Databases, pages AAAI/MIT Press, [19] J. Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46: , [20] C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2(1):39 68, [21] P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right objective measure for association analysis. Information Systems, 29(4): , [22] H. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H. Mannila, M. Herr, and J. Kere. Data mining applied to linkage disequilibrium mapping. American Journal of Human Genetics, 67: , [23] G. Webb. Discovering significant rules. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 06), pages , New York, USA, ACM Press. [24] G. I. Webb. Discovering significant patterns. Machine Learning, 68(1):1 33,

10 Table 2. Description of data sets and test parameters. Data StatApr Apriori n k min cf min Z min fr 1a Mushroom b Mushroom a Chess b Chess a T10I4D100K b T10I4D100K a T40I10D100K b T40I10D100K a Plants b Plants a Garden b Garden Table 3. Average rule accuracy and lift with different measure functions. StatApr Apriori test Z&p chi 2 J z fr γ err γ err γ err γ err γ err 1a b a b a b a b a b a b

Guaranteeing the Accuracy of Association Rules by Statistical Significance

Guaranteeing the Accuracy of Association Rules by Statistical Significance Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge

More information

Efficient search of association rules with Fisher s exact test

Efficient search of association rules with Fisher s exact test fficient search of association rules with Fisher s exact test W. Hämäläinen Abstract limination of spurious and redundant rules is an important problem in the association rule discovery. Statistical measure

More information

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen Book Title Encyclopedia of Machine Learning Chapter Number 00403 Book CopyRight - Year 2010 Title Frequent Pattern Author Particle Given Name Hannu Family Name Toivonen Suffix Email hannu.toivonen@cs.helsinki.fi

More information

On Minimal Infrequent Itemset Mining

On Minimal Infrequent Itemset Mining On Minimal Infrequent Itemset Mining David J. Haglin and Anna M. Manning Abstract A new algorithm for minimal infrequent itemset mining is presented. Potential applications of finding infrequent itemsets

More information

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise Guimei Liu 1,2 Jinyan Li 1 Limsoon Wong 2 Wynne Hsu 2 1 Institute for Infocomm Research, Singapore 2 School

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of

More information

Mining Positive and Negative Fuzzy Association Rules

Mining Positive and Negative Fuzzy Association Rules Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.

More information

Removing trivial associations in association rule discovery

Removing trivial associations in association rule discovery Removing trivial associations in association rule discovery Geoffrey I. Webb and Songmao Zhang School of Computing and Mathematics, Deakin University Geelong, Victoria 3217, Australia Abstract Association

More information

CS 584 Data Mining. Association Rule Mining 2

CS 584 Data Mining. Association Rule Mining 2 CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M

More information

Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries To appear in Data Mining and Knowledge Discovery, an International Journal c Kluwer Academic Publishers

More information

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border Guimei Liu a,b Jinyan Li a Limsoon Wong b a Institute for Infocomm Research, Singapore b School of

More information

Efficient discovery of the top-k optimal dependency rules with Fisher s exact test of significance

Efficient discovery of the top-k optimal dependency rules with Fisher s exact test of significance Efficient discovery of the top-k optimal dependency rules with Fisher s exact test of significance Wilhelmiina Hämäläinen epartment of omputer Science University of Helsinki Finland whamalai@cs.helsinki.fi

More information

Efficient discovery of statistically significant association rules

Efficient discovery of statistically significant association rules Efficient discovery of statistically significant association rules Wilhelmiina Hämäläinen Deartment of Comuter Science University of Helsinki Finland whamalai@cs.helsinki.fi ABSTRACT In this aer, we introduce

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - 1DL36 Fall 212" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht12 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 04 Association Analysis Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

Association Rules. Fundamentals

Association Rules. Fundamentals Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions. Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei And slides provide by Raymond

More information

732A61/TDDD41 Data Mining - Clustering and Association Analysis

732A61/TDDD41 Data Mining - Clustering and Association Analysis 732A61/TDDD41 Data Mining - Clustering and Association Analysis Lecture 6: Association Analysis I Jose M. Peña IDA, Linköping University, Sweden 1/14 Outline Content Association Rules Frequent Itemsets

More information

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Claudio Lucchese Ca Foscari University of Venice clucches@dsi.unive.it Raffaele Perego ISTI-CNR of Pisa perego@isti.cnr.it Salvatore

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization

More information

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Juliano Brito da Justa Neves 1 Marina Teresa Pires Vieira {juliano,marina}@dc.ufscar.br Computer Science Department

More information

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries Data Mining and Knowledge Discovery, 7, 5 22, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency

More information

Efficient search for statistically significant dependency rules in binary data

Efficient search for statistically significant dependency rules in binary data Department of Computer Science Series of Publications A Report A-2010-2 Efficient search for statistically significant dependency rules in binary data Wilhelmiina Hämäläinen To be presented, with the permission

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

Chapter 4: Frequent Itemsets and Association Rules

Chapter 4: Frequent Itemsets and Association Rules Chapter 4: Frequent Itemsets and Association Rules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - DL360 Fall 200 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht0 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Σοφια: how to make FCA polynomial?

Σοφια: how to make FCA polynomial? Σοφια: how to make FCA polynomial? Aleksey Buzmakov 1,2, Sergei Kuznetsov 2, and Amedeo Napoli 1 1 LORIA (CNRS Inria NGE Université de Lorraine), Vandœuvre-lès-Nancy, France 2 National Research University

More information

Bottom-Up Propositionalization

Bottom-Up Propositionalization Bottom-Up Propositionalization Stefan Kramer 1 and Eibe Frank 2 1 Institute for Computer Science, Machine Learning Lab University Freiburg, Am Flughafen 17, D-79110 Freiburg/Br. skramer@informatik.uni-freiburg.de

More information

Mining Approximative Descriptions of Sets Using Rough Sets

Mining Approximative Descriptions of Sets Using Rough Sets Mining Approximative Descriptions of Sets Using Rough Sets Dan A. Simovici University of Massachusetts Boston, Dept. of Computer Science, 100 Morrissey Blvd. Boston, Massachusetts, 02125 USA dsim@cs.umb.edu

More information

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Association Rule. Lecturer: Dr. Bo Yuan. LOGO Association Rule Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Overview Frequent Itemsets Association Rules Sequential Patterns 2 A Real Example 3 Market-Based Problems Finding associations

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Sánchez 18th September 2013 Detecting Anom and Exc Behaviour on

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

A Concise Representation of Association Rules using Minimal Predictive Rules

A Concise Representation of Association Rules using Minimal Predictive Rules A Concise Representation of Association Rules using Minimal Predictive Rules Iyad Batal and Milos Hauskrecht Department of Computer Science University of Pittsburgh {iyad,milos}@cs.pitt.edu Abstract. Association

More information

Data mining, 4 cu Lecture 5:

Data mining, 4 cu Lecture 5: 582364 Data mining, 4 cu Lecture 5: Evaluation of Association Patterns Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Evaluation of Association Patterns Association rule algorithms

More information

Selecting a Right Interestingness Measure for Rare Association Rules

Selecting a Right Interestingness Measure for Rare Association Rules Selecting a Right Interestingness Measure for Rare Association Rules Akshat Surana R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad

More information

Describing Data Table with Best Decision

Describing Data Table with Best Decision Describing Data Table with Best Decision ANTS TORIM, REIN KUUSIK Department of Informatics Tallinn University of Technology Raja 15, 12618 Tallinn ESTONIA torim@staff.ttu.ee kuusik@cc.ttu.ee http://staff.ttu.ee/~torim

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

Mining Free Itemsets under Constraints

Mining Free Itemsets under Constraints Mining Free Itemsets under Constraints Jean-François Boulicaut Baptiste Jeudy Institut National des Sciences Appliquées de Lyon Laboratoire d Ingénierie des Systèmes d Information Bâtiment 501 F-69621

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

Chapter 2 Quality Measures in Pattern Mining

Chapter 2 Quality Measures in Pattern Mining Chapter 2 Quality Measures in Pattern Mining Abstract In this chapter different quality measures to evaluate the interest of the patterns discovered in the mining process are described. Patterns represent

More information

CSC 411 Lecture 3: Decision Trees

CSC 411 Lecture 3: Decision Trees CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 03-Decision Trees 1 / 33 Today Decision Trees Simple but powerful learning

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

Discovery of Functional and Approximate Functional Dependencies in Relational Databases

Discovery of Functional and Approximate Functional Dependencies in Relational Databases JOURNAL OF APPLIED MATHEMATICS AND DECISION SCIENCES, 7(1), 49 59 Copyright c 2003, Lawrence Erlbaum Associates, Inc. Discovery of Functional and Approximate Functional Dependencies in Relational Databases

More information

Constraint-Based Rule Mining in Large, Dense Databases

Constraint-Based Rule Mining in Large, Dense Databases Appears in Proc of the 5th Int l Conf on Data Engineering, 88-97, 999 Constraint-Based Rule Mining in Large, Dense Databases Roberto J Bayardo Jr IBM Almaden Research Center bayardo@alummitedu Rakesh Agrawal

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Interesting Patterns. Jilles Vreeken. 15 May 2015

Interesting Patterns. Jilles Vreeken. 15 May 2015 Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the Day What is interestingness? what is a pattern? and how can we mine interesting patterns? What is a pattern? Data Pattern y = x - 1 What

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population

More information

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Analytics Beyond OLAP. Prof. Yanlei Diao Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of

More information

Standardizing Interestingness Measures for Association Rules

Standardizing Interestingness Measures for Association Rules Standardizing Interestingness Measures for Association Rules arxiv:138.374v1 [stat.ap] 16 Aug 13 Mateen Shaikh, Paul D. McNicholas, M. Luiza Antonie and T. Brendan Murphy Department of Mathematics & Statistics,

More information

A Global Constraint for Closed Frequent Pattern Mining

A Global Constraint for Closed Frequent Pattern Mining A Global Constraint for Closed Frequent Pattern Mining N. Lazaar 1, Y. Lebbah 2, S. Loudni 3, M. Maamar 1,2, V. Lemière 3, C. Bessiere 1, P. Boizumault 3 1 LIRMM, University of Montpellier, France 2 LITIO,

More information

Statistically Significant Dependencies for Spatial Co-location Pattern Mining and Classification Association Rule Discovery.

Statistically Significant Dependencies for Spatial Co-location Pattern Mining and Classification Association Rule Discovery. Statistically Significant Depencies for Spatial Co-location Pattern Mining and Classification Association Rule Discovery by Jundong Li A thesis submitted in partial fulfillment of the requirements for

More information

Cube Lattices: a Framework for Multidimensional Data Mining

Cube Lattices: a Framework for Multidimensional Data Mining Cube Lattices: a Framework for Multidimensional Data Mining Alain Casali Rosine Cicchetti Lotfi Lakhal Abstract Constrained multidimensional patterns differ from the well-known frequent patterns from a

More information

CS 484 Data Mining. Association Rule Mining 2

CS 484 Data Mining. Association Rule Mining 2 CS 484 Data Mining Association Rule Mining 2 Review: Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due

More information

Pattern-Based Decision Tree Construction

Pattern-Based Decision Tree Construction Pattern-Based Decision Tree Construction Dominique Gay, Nazha Selmaoui ERIM - University of New Caledonia BP R4 F-98851 Nouméa cedex, France {dominique.gay, nazha.selmaoui}@univ-nc.nc Jean-François Boulicaut

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Associa'on Rule Mining

Associa'on Rule Mining Associa'on Rule Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 4 and 7, 2014 1 Market Basket Analysis Scenario: customers shopping at a supermarket Transaction

More information

Association Analysis Part 2. FP Growth (Pei et al 2000)

Association Analysis Part 2. FP Growth (Pei et al 2000) Association Analysis art 2 Sanjay Ranka rofessor Computer and Information Science and Engineering University of Florida F Growth ei et al 2 Use a compressed representation of the database using an F-tree

More information

Maintaining Frequent Itemsets over High-Speed Data Streams

Maintaining Frequent Itemsets over High-Speed Data Streams Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,

More information

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview Algorithmic Methods of Data Mining, Fall 2005, Course overview 1 Course overview lgorithmic Methods of Data Mining, Fall 2005, Course overview 1 T-61.5060 Algorithmic methods of data mining (3 cp) P T-61.5060

More information

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation 12.3.2008 Lauri Lahti Association rules Techniques for data mining and knowledge discovery in databases

More information

Frequent Itemset Mining

Frequent Itemset Mining ì 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM COCONUT Team (PART I) IMAGINA 17/18 Webpage: http://www.lirmm.fr/~lazaar/teaching.html Email: lazaar@lirmm.fr 2 Data Mining ì Data Mining (DM) or Knowledge

More information

Transaction Databases, Frequent Itemsets, and Their Condensed Representations

Transaction Databases, Frequent Itemsets, and Their Condensed Representations Transaction Databases, Frequent Itemsets, and Their Condensed Representations Taneli Mielikäinen HIIT Basic Research Unit Department of Computer Science University of Helsinki, Finland Abstract. Mining

More information

Quantitative Association Rule Mining on Weighted Transactional Data

Quantitative Association Rule Mining on Weighted Transactional Data Quantitative Association Rule Mining on Weighted Transactional Data D. Sujatha and Naveen C. H. Abstract In this paper we have proposed an approach for mining quantitative association rules. The aim of

More information

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli

More information

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms DATA MINING LECTURE 4 Frequent Itemsets, Association Rules Evaluation Alternative Algorithms RECAP Mining Frequent Itemsets Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Decision Trees Claude Monet, The Mulberry Tree Slides from Pedro Domingos, CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Michael Guerzhoy

More information

Connections between mining frequent itemsets and learning generative models

Connections between mining frequent itemsets and learning generative models Connections between mining frequent itemsets and learning generative models Srivatsan Laxman Microsoft Research Labs India slaxman@microsoft.com Prasad Naldurg Microsoft Research Labs India prasadn@microsoft.com

More information

Naive Bayesian classifiers for multinomial features: a theoretical analysis

Naive Bayesian classifiers for multinomial features: a theoretical analysis Naive Bayesian classifiers for multinomial features: a theoretical analysis Ewald van Dyk 1, Etienne Barnard 2 1,2 School of Electrical, Electronic and Computer Engineering, University of North-West, South

More information

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Chapter 5 Association Analysis: Basic Concepts Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 2/3/28 Introduction to Data Mining Association Rule Mining Given

More information

Layered critical values: a powerful direct-adjustment approach to discovering significant patterns

Layered critical values: a powerful direct-adjustment approach to discovering significant patterns Mach Learn (2008) 71: 307 323 DOI 10.1007/s10994-008-5046-x TECHNICAL NOTE Layered critical values: a powerful direct-adjustment approach to discovering significant patterns Geoffrey I. Webb Received:

More information

Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules

Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules International Journal of Innovative Research in Computer Scien & Technology (IJIRCST) Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules Mir Md Jahangir

More information

Principles of AI Planning

Principles of AI Planning Principles of AI Planning 5. Planning as search: progression and regression Albert-Ludwigs-Universität Freiburg Bernhard Nebel and Robert Mattmüller October 30th, 2013 Introduction Classification Planning

More information

Principles of AI Planning

Principles of AI Planning Principles of 5. Planning as search: progression and regression Malte Helmert and Bernhard Nebel Albert-Ludwigs-Universität Freiburg May 4th, 2010 Planning as (classical) search Introduction Classification

More information

Investigating Measures of Association by Graphs and Tables of Critical Frequencies

Investigating Measures of Association by Graphs and Tables of Critical Frequencies Investigating Measures of Association by Graphs Investigating and Tables Measures of Critical of Association Frequencies by Graphs and Tables of Critical Frequencies Martin Ralbovský, Jan Rauch University

More information

CPDA Based Fuzzy Association Rules for Learning Achievement Mining

CPDA Based Fuzzy Association Rules for Learning Achievement Mining 2009 International Conference on Machine Learning and Computing IPCSIT vol.3 (2011) (2011) IACSIT Press, Singapore CPDA Based Fuzzy Association Rules for Learning Achievement Mining Jr-Shian Chen 1, Hung-Lieh

More information

Mining Literal Correlation Rules from Itemsets

Mining Literal Correlation Rules from Itemsets Mining Literal Correlation Rules from Itemsets Alain Casali, Christian Ernst To cite this version: Alain Casali, Christian Ernst. Mining Literal Correlation Rules from Itemsets. IMMM 2011 : The First International

More information

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland

More information

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH M. De Cock C. Cornelis E. E. Kerre Dept. of Applied Mathematics and Computer Science Ghent University, Krijgslaan 281 (S9), B-9000 Gent, Belgium phone: +32

More information

Chapter 3 Deterministic planning

Chapter 3 Deterministic planning Chapter 3 Deterministic planning In this chapter we describe a number of algorithms for solving the historically most important and most basic type of planning problem. Two rather strong simplifying assumptions

More information

On the robustness of association rules

On the robustness of association rules On the robustness of association rules Philippe Lenca, Benoît Vaillant, Stéphane Lallich GET/ENST Bretagne CNRS UMR 2872 TAMCIC Technopôle de Brest Iroise CS 8388, 29238 Brest Cedex, France Email: philippe.lenca@enst-bretagne.fr

More information

Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength

Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength Xiangjun Dong School of Information, Shandong Polytechnic University Jinan 250353, China dongxiangjun@gmail.com

More information

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets CS-C3160 - Data Science Chapter 8: Discrete methods for analyzing large binary datasets Jaakko Hollmén, Department of Computer Science 30.10.2017-18.12.2017 1 Rest of the course In the first part of the

More information

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29 Data Mining and Knowledge Discovery Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2011/11/29 1 Practice plan 2011/11/08: Predictive data mining 1 Decision trees Evaluating classifiers 1: separate test set,

More information

Machine Learning: Pattern Mining

Machine Learning: Pattern Mining Machine Learning: Pattern Mining Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008 Pattern Mining Overview Itemsets Task Naive Algorithm Apriori Algorithm

More information

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING Santiago Ontañón so367@drexel.edu Summary so far: Rational Agents Problem Solving Systematic Search: Uninformed Informed Local Search Adversarial Search

More information

Frequent Pattern Mining: Exercises

Frequent Pattern Mining: Exercises Frequent Pattern Mining: Exercises Christian Borgelt School of Computer Science tto-von-guericke-university of Magdeburg Universitätsplatz 2, 39106 Magdeburg, Germany christian@borgelt.net http://www.borgelt.net/

More information

Constraint-based Subspace Clustering

Constraint-based Subspace Clustering Constraint-based Subspace Clustering Elisa Fromont 1, Adriana Prado 2 and Céline Robardet 1 1 Université de Lyon, France 2 Universiteit Antwerpen, Belgium Thursday, April 30 Traditional Clustering Partitions

More information

Pattern Structures 1

Pattern Structures 1 Pattern Structures 1 Pattern Structures Models describe whole or a large part of the data Pattern characterizes some local aspect of the data Pattern is a predicate that returns true for those objects

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

More information

From inductive inference to machine learning

From inductive inference to machine learning From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences

More information

Approximating a Collection of Frequent Sets

Approximating a Collection of Frequent Sets Approximating a Collection of Frequent Sets Foto Afrati National Technical University of Athens Greece afrati@softlab.ece.ntua.gr Aristides Gionis HIIT Basic Research Unit Dept. of Computer Science University

More information

Levelwise Search and Borders of Theories in Knowledge Discovery

Levelwise Search and Borders of Theories in Knowledge Discovery Data Mining and Knowledge Discovery 1, 241 258 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Levelwise Search and Borders of Theories in Knowledge Discovery HEIKKI MANNILA

More information