Efficient discovery of statistically significant association rules
|
|
- Constance Ward
- 6 years ago
- Views:
Transcription
1 fficient discovery of statistically significant association rules Wilhelmiina Hämäläinen Department of Computer Science University of Helsinki Finland Matti Nykänen Department of Computer Science University of Kuopio Finland Abstract Searching statistically significant association rules is an important but neglected problem. Traditional association rules do not capture the idea of statistical dependence and the resulting rules can be spurious, while the most significant rules may be missing. This leads to erroneous models and predictions which often become expensive. The problem is computationally very difficult, because the significance is not a monotonic property. However, in this paper we prove several other properties, which can be used for pruning the search space. The properties are implemented in the StatApriori algorithm, which searches statistically significant, non-redundant association rules. Based on both theoretical and empirical observations, the resulting rules are very accurate compared to traditional association rules. In addition, StatApriori can work with extremely low frequencies, thus finding new interesting rules. 1. Introduction Traditional association rules [1] are rules of form if event X = x occurs, then also event A = a is likely to occur. The commonness of the rule is measured by frequency P (X = x, A = a) and the strength of the rule by confidence P (A = a X = x). For computational purposes it is required that both frequency and confidence should exceed some user-defined thresholds. The actual interestingness of the rule is usually decided afterwards, by some interestingness measure. Often the associations are interpreted as correlations or dependencies between certain attribute value combinations. However, traditional association rules do not necessarily capture statistical dependencies, but they can associate absolutely independent events while ignoring strong dependencies. As a solution, it is often suggested (following the axioms by Piatetsky-Shapiro [18]) to measure the lift (interest) instead of the confidence (e.g. [21]). This produces also statistically more sound results, but still it is possible to find spurious rules while missing statistically significant rules. Often, these two error types are called type 1 and type 2 errors (in computer science terms, false positive and false negative). In the worst case, all discovered rules can be spurious [23, 24]. In practice, this means that the future data does not exhibit the discovered dependencies and the conclusions based on them are erroneous. The results can be expensive or even fatal, as the following example demonstrates. xample 1. A biological database contains observation reports from different kinds of biotopes, like grove, marsh, waterside, coniferous forest, etc. For association analysis, each report is represented as a binary vector, listing the observed species, along with biotope characteristics. Local forestry societies as well as individual land owners can use the data when they decide e.g. fellings or protected sites. The forestry society FallAll is going to drain swamps for new forests. Before any decisions are made, they search associations from the 1000 observations on marsh sides. They use minimum frequency 0.05 and minimum confidence One discovered rule is leather leaf cloudberry with frequency 0.06 and confidence Since cloudberries are commercially important product, the forestry society decides to protect a marsh growing leather leaves, when other swamps are drained. The decision is excellent for the leather leaf, but all cloudberries in the area disappear. The reason is that cloudberries require a wet swamp, while leather leaves can grow in both moist and wet sides. The only protected swamp in the area was too dry for cloudberries. This catastrophe was due to a spurious rule leather leaf cloudberry. The rule has p-value 0.13 which means that there 13% probability to make a type 1 error. In the same time, the forest society misses an important rule, namely wet swamp,leather leaf cloudberry. This rule was not found, because it had too low frequency, However, it is a strong rule with confidence 1.0. The p-
2 value is which indicates that the rule is quite reliable. Roughly speaking, it means that there is only 1.1% probability that the rule is spurious. The problems of association rules and especially the frequency-confidence-framework are well-known ([23, 24, 6, 16]), but still there have been only few attempts to solve the problem. Quite likely, the reason is purely practical: the problem has been considered computationally intractable. Statistical significance is not a monotonic property and therefore it cannot be used for pruning the search space in the same manner as the frequency. However, when we search directly statistically significant rules (instead of sets), we can utilize other properties for efficient pruning. More efficiency is achieved by searching only minimal (non-redundant) statistically significant rules. Such rules are at least as good as the pruned rules, but simpler, and no information is lost. In practice, the simpler rules avoid overfitting and hold better in the future data. In this paper, we introduce a set of properties which can be used for searching minimal, statistically most significant association rules. The properties are implemented in the StatApriori algorithm. StatApriori guarantees that no significant rules are missed, while the number of spurious rule candidates generated during the execution is kept minimal. Compared to a modification of the classical Apriori algorithm, which also produces all significant association rules, StatApriori is very efficient. It can tackle problems which are impossible to compute with the classical Apriori. As far as we know, the algorithm is the first of its kind. The previous algorithms have restricted in statistically significant classification rules using χ 2 -measure (e.g. [15, 16, 4, 22, 17]). This is quite a different problem, because in classification both X C and X C should be accurate. In other words, classification rules describe dependencies between attributes, while association rules describe dependencies between events. Webb [24] has done pioneering work in testing the statistical significance of association rules by Fisher s exact test. Fisher s exact test like χ 2 is designed for measuring dependence between attributes and some significant association rules can be missed (type 2 error). However, no new algorithms were introduced in these experiments and the search proved to be infeasible on large data sets with the existing techniques. In addition, several interestingness measures (see e.g. [9] for an overview) have their origins in statistics, but they do not measure the statistical significance of association rules. As a result, they can cause both type 1 and type 2 errors. The organization of the paper is the following: The basic definitions are given in Section 2. The main principles of the search are given in Section 3 and the StatApriori algorithm in Section 4. xperimental results are reported in Section 5 and the final conclusions are drawn in Section Basic definitions In the following we give basic definitions of the association rule, statistical dependence, statistical significance, and redundancy. The notations are introduced in Table 1. Table 1. Basic notations. Notation Meaning A, B, C,... binary attributes a, b, c,... {0, 1} attribute values R = {A 1,..., A k } set of all attributes R = k number of attributes Dom(R) = {0, 1} k attribute space X, Y, Z R attribute sets Dom(X) = {0, 1} l domain of X, X = l (X = x) = {(A 1 = a 1 ),..., event, X = l (A l = a l )} t = {A 1 = t(a 1 ),..., A k = t(a k )} row (tuple, transaction) r = {t 1,..., t n t i Dom(R)} relation (data set) r = n size of relation r σ X=x (r) = {t r t[x] = x} set of rows where X = x m(x = x) = σ X=x (r) number of rows where X = x P (X = x) = m(x=x) relative frequency of n X = x i(fr, γ) measure functions I(X A) = i(p (XA), γ(x, A)) upperbound(f) an upperbound for function f bestrule(x) = the best rule which can be arg max A X {I(X \ A A)} constructed from X P S(X) property potentially significant ; whether significant rules can be derived from X or its supersets minattr(x) = minimum attribute of X; arg min{p (A i ) A i X} one with the lowest frequency 2.1. Association rules Traditionally, association rules are defined in the frequency-confidence framework: Definition 1 (Association rule). Let R be a set of binary attributes and r a relation according to R. Let X R, A R \ X, x Dom(X), and a Dom(A). The confidence of rule (X = x) (A = a) is cf(x = x A = a) = P (X = x, A = a) P (X = x) = P (A = a X = x) and the frequency (support) of the rule is fr(x = x A = a) = P (X = x, A = a). Given user-defined thresholds min cf, min fr [0, 1], rule (X = x) (A = a) is an association rule in r, if 2
3 (i) cf(x = x A = a) min cf, and (ii) fr(x = x A = a) min fr. The first condition requires that an association rule should be strong enough and the second condition requires that it should be common enough. In this paper, we call rules association rules, even if no thresholds min fr and min cf are specified. Usually, it is assumed that the rule contains only positive attribute values (A i = 1). Now, the rule can be expressed simply by listing the attributes, e.g. A 1, A 3, A 5 A Statistical dependence Statistical dependence is usually defined through statistical independence (e.g. [20, 14]): Definition 2 (Independence and dependence). Let X R and A R \ X be sets of binary attributes. vents X = x and A = a, x Dom(X), a Dom(A), are mutually independent, if P (X = x, A = a) = P (X = x)p (A = a). If the events are not independent, they are dependent. The strength of the statistical dependence between (X = x) and (A = a) can be measured by lift or interest: γ(x = x, A = a) = P (X = x, A = a) P (X = x)p (A = a). In the following, we will concentrate on the dependencies between events containing only positive attributes. The lift of rule X A is denoted simply γ(x, A) Statistical significance In this work, we analyze the statistical significance of association rules in the classical (frequentist) framework. Bayesian significance testing offers an interesting alternative, but it is still little studied in this context. Both approaches produce asymptotically similar results (under some assumptions, the test results are identical), although the Bayesian testing is sensitive to the selected prior probabilities. [3] The idea of classical statistical significance tests (see e.g. [8, Ch. 26] or [12, Ch. 10.1]) is to estimate the probability of the observed or a rarer phenomenon, under some null hypothesis. When the objective is to test the significance of the dependency between events X and A, the null hypothesis is the independence assumption: P (X, A) = P (X)P (A). The task is to calculate the probability p that the observed or higher frequency occurs in the data, if the events were actually independent. If the estimated probability p is very small, then the observed dependency is said to be significant at level p. The significance of the observed frequency m(x, A) can be estimated exactly by the binomial distribution. ach row in relation r, r = n, corresponds to an independent Bernoulli trial, whose outcome is either 1 (XA occurs) or 0 (XA does not occur). All rows are mutually independent. Assuming the independence of attributes X and A, combination XA occurs on a row with probability P (X)P (A). Now the number of rows containing X, A is a binomial random variable M with parameters P (X)P (A) and n. The mean of M is µ M = np (X)P (A) and its variance is σm 2 = np (X)P (A)(1 P (X)P (A)). Probability P (M m(x, A)) gives the significance p: p = m(x) i=m(x,a) ( n i ) (P (X)P (A)) i (1 P (X)P (A)) n i. The significance can be used in two ways to prune association rules: either 1) we can set the significance level (maximum p) and search all rules with sufficiently low p, or 2) use the p-values to search the K most significant rules. Deciding the required significance level is a difficult problem, which we do not try to solve here. The problem is that the more rules we test, the more spurious rules are likely to pass the significance test. Webb [24] has suggested a solution in the context of association rule discovery using the Bonferroni adjustment [19] The z-score The binomial probability is quite difficult to calculate, but for our purposes it is enough to have an upper bound for the p-value. This guarantees that no rules with a low p- value are lost when the search space is pruned. Additional pruning and ranking can be done afterwards, when the actual binomial probabilities are calculated. The simplest upper bound is based on the (binomial) z- score: z(x, A) = m(x, A) µ M = = σ M m(x, A) np (X)P (A) np (X)P (A)(1 P (X)P (A)) n(γ(x, A) 1) γ(x, A) P (X, A). The z-score measures how many standard deviations (σ M ) the observed frequency m(x, A) deviates from the expected value µ M = np (X)P (A). The corresponding probability can be easily approximated, because z follows the standard normal distribution, when n is sufficiently large and P (X)P (A) (or 1 P (X)P (A)) is neither close to 0 nor to 1. As a rule of thumb, the approximation can be used, 3
4 when np (X)P (A) 5 (e.g. [12, 147]). According to [7], the approximation works well even for np (X)P (A) 2, if continuity correction (subtracting 0.5 from m(x, A)) is used. When P (X)P (A) is low, the binomial distribution is positively skewed. This means that the z-score overestimates the significance. Therefore, we will not use normal approximation to estimate the p-values, but the z-score is used only as a measure function. We note that the z-score is not crucial to our method, but several other measure functions can be used, as well. The requirement is that the measure I is a monotonically increasing or decreasing function of m(x, A) and γ(x, A). For example, when the expected value P (X)P (A) is very low, we can derive a tight upperbound for p from the Chernoff bound [10]: e δµ M P (M > µ M (1 + δ) <. (1 + δ) (1+δ)µ M By inserting δ = γ 1, where γ = γ(x, A), and using γµ = m(x, A), we achieve p ch = P (M > m(x, A)) < (e (γ 1) γ γ ) m(x,a). This is monotonically decreasing with both m(x, A) and γ Redundancy A common goal in association rule discovery is to find minimal (or most general) interesting rules, and prune out redundant rules [5]. The reasons are twofold: First, the number of discovered rules is typically too large (even hundreds of thousand rules) for any human interpreter. According to the Occam s Razor principle, it is only sensible to prune out complex rules X A, if their generalizations Z A, Z X are at least equally interesting. The user just has to define the interestingness measure carefully, according to the modelling purposes. Second, pruning redundant rules can save the search time enormously, if it is done on-line. This is not possible with many interestingness functions, and usually the pruning is done afterwards. In our case, the interestingness measure is the statistical significance, but in general, redundancy and minimality can be defined with respect to any other measure function. Definition 3 (Redundant rules). Given an increasing interestingness measure I, rule X A is redundant, if there exists rule X A such that X {A } X {A} and I(X A ) I(X A). If the rule is not redundant, then it is called non-redundant. I.e. a rule is non-redundant, if all its generalizations ( parent rules ) are less significant. It is still possible that some or all of its specializations ( children rules ) are better. In the latter case, the rule is unlikely interesting itself. Non-redundant rules can be further classified as minimal or non-minimal: Definition 4 (Minimal rules). Non-redundant rule X A is minimal, if for all rules X A, such that X {A } X {A}, I(X A) I(X A ). I.e. a minimal rule is more significant than any of its parent or children rules. In the algorithmic level this means that we stop the search without checking any children rules, if we have just ensured that the rule is minimal. 3. Main principles In this section, we will introduce the main principles of the search algorithm. The results are given on such a general level that any suitable measure function or search strategy can be applied Problem definition Let us first define the problem formally: Definition 5 (Search problem). Let p(x A) denote the p-value of rule X A. Given binary data r and threshold p max R, the problem is to search a set of association rules S such that for all X A S 1. X A expresses a positive correlation, i.e. γ(x A) > 1, 2. X A is non-redundant, 3. for all Y B / S, p(x A) p(y B), and 4. p(x A) p max. We note that the user has to select only one parameter, p max. Alternatively, we could define an optimization problem, where the N best rules (with lowest p-values) are searched. Let us now assume that we have a measure function i(f r, γ) which defines an upperbound for the binomial probability. For any rule X A, I(X A) = i(p (XA), γ(x, A)). In addition, let i be either monotonically increasing or decreasing with both frequency f r and lift γ. The search problem can be divided into two subproblems: 1. Search all non-redundant rules X A for which p(x A) p max using i. 4
5 Figure 1. The general Apriori algorithm. Input: set of attributes R, data set r, an anti-monotonic property π Output: {X P(R) π(x) = 1} Method: // Initialization S 1 = {A i R π(a i ) = 1} l = 1 while (S l ) // Step 1: Candidate generation Generate C l+1 from S l // Step 2: Pruning S l+1 = {c C l+1 π(c) = 1} l = l + 1 return l S l 2. Calculate the exact p-values and output rules with sufficiently low p. The postprocessing step is trivial and we will concentrate on only the search step. For simplicity, we assume that i is monotonically increasing; a monotonically decreasing measure function is handled similarly. The measure function I guarantees that all significant rules at level p max are discovered. For efficiency, i should also prune out as many insignificant or redundant rules as possible Monotonic and anti-monotonic properties The key idea of the classical Apriori algorithm [2, 13] is the anti-monotonicity of frequency. For attribute sets, the monotonicity and anti-monotonicity are defined as follows: Definition 6 (Monotonic and anti-monotonic properties). Property π : P(R) {0, 1} is monotonic, if (π(y ) = 1) (π(x) = 1) for all X Y, and anti-monotonic, if (π(x) = 1) (π(y ) = 1) for all Y X. When π is anti-monotonic (π(y ) = 0) (π(x) = 0) for all X Y. When the measure function defines an anti-monotonic property, the interesting sets or rules can be searched with the general Apriori algorithm (Figure 1). The problem is that the measure functions for the statistical significance do not define any anti-monotonic property. However, it turns out that the upper-bound for the measure function I defines an anti-monotonic property for most set-inclusion relations Property P S Let us define the property P S, potentially significant. Potential significance of set X is a necessary condition for constructing any significant rule X \ A A. Definition 7. Let measure function I be as before, min I a user-defined threshold, and upperbound(f) an upperbound for function f. Let bestrule(x) = arg max A X {I(X \ A A)} be the best rule which can be constructed from attributes X. Property P S : P(R) {0, 1} is defined as P S(X) = 1, iff upperbound(i(bestrule(x))) min I. Now it is enough to define the conditions under which P S behaves anti-monotonically. The following theorem is the core of the whole search algorithm: Theorem 1. Let P S, X, and Y be as before. If (P S(X) = 1), then (P S(Y ) = 1) for all Y X such that minattr(x) = minattr(y ). Proof. First observe that for all A X we have γ(x \ A, A)) 1 P (A) 1 P (minattr(x)) and upperbound(i(x \ 1 A A)) = i(p (X), P (minattr(x)) ). By them we have upperbound(i(bestrule(x))) = 1 i(p (X), P (minattr(x)) ) i(p (Y ), 1 P (minattr(y )) ) = upperbound(i(bestrule(y ))) for all Y X such that minattr(x) = minattr(y ). We have min I upperbound(i(bestrule(x))) by the definition of P S(X) = 1. Hence the reasoning above yields also min I upperbound(i(bestrule(y ))), as required for the definition of P S(Y ) = 1. Corollary 1. If P S(Y ) = 0, then P S(X) = 0 for all X Y such that minattr(x) = minattr(y ). We have shown that property P S defines an antimonotonic property among sets having the same minimum attribute. Let us now consider the exceptional case, when the anti-monotonicity does not hold. Let us call X an l-set, if X = l. Let the (l 1)-subsets, Y i X, Y i = l 1, be called X s parent sets. Now X has l parent sets, from which only one has a different minimum attribute. The exceptional parent set is Y l = X \ {minattr(x)}. If P (minattr(y l )) > P (minattr(x)), Y l has a lower upperbound for γ than Y 1,..., Y l 1 and X have. Therefore, it is possible that Y l is non-p S, even if X is P S. 4. Algorithm Next, we give the StatApriori algorithm, which implements the pruning properties. 5
6 4.1. The main idea The StatApriori algorithm proceeds like the general Apriori (Figure 1) using property P S. It alternates between the candidate generation and pruning steps, as long as new non-redundant, potentially significant rules can be found. However, special techniques are needed, because property P S is not anti-monotonic in all respects. First, the attributes are arranged into an ascending order by their frequencies. Let the renamed attributes be {A 1,..., A k }, where P (A 1 )... P (A k ). The idea is that the candidates are generated in the canonical order. From l-set X = {A 1,..., A l }, we can generate (l + 1)-sets X {A j }, where j > l. Now all supersets of X have the same upper-bound for the lift, γ 1 P (A 1 ). If X is non-p S, then none of its descendants can be P S. Otherwise, we should check the other parent sets Z X {A j }, Z = l. If at most one of them, (X \ {A 1 }) {A j }, is non-p S, then X {A j } is added to the candidate collection C l+1. If (X \ {A 1 }) {A j } was non-p S, it is also added to a temporary collection of special parents for frequency counting. After candidate generation, the exact frequencies are counted from the data. Candidates which are non-p S or can produce only redundant descendants, will be pruned, and others are added to collection S l+1. The minimality of P S rules is checked, because no new candidates are generated from minimal rules. The principles for redundancy and minimality checking are 1. If we have γ(bestrule(x)) = 1 P (minattr(x)), then the lift is already maximal possible, and none of X s specializations can gain a better p-value. The rule is marked as minimal. 2. If we have upperbound(i(bestrule(x {A j }))) i(bestrule(z)) for some attributes Z X {A j }, then X {A j } and all its specializations will be redundant with respect to Z. X {A j } is removed numeration tree The secret of StatApriori is a special kind of enumeration tree, which enables an efficient implementation of pruning principles. A complete enumeration tree lists all sets in the powerset P(R). In practice, it can be implemented as a trie, where each root node path corresponds an item set. StatApriori uses an ordered enumeration tree, where the attributes are arranged into ascending order by their frequencies. Figure 2) shows an example, when R = {A, B, C, D, }, and P (A) P (B)... P (). Solid lines represent an example tree, when two levels are generated, and dash lines show the nodes missing from a complete tree. Let us now consider the candidate generation at the third level. The missing nodes at the second level are either insignificant or they and all their descendants are redundant. Set {A, B, C} is added to the tree, because all its parent sets, {A, B}, {A, C}, and {B, C} are in the tree (i.e. P S) and non-minimal. Sets {A, B, D} and {A, B, } are not added, because they have non-p S parents (missing sets {A, D} and {A, }) with the same minimum attribute A. The same holds for {A, C, D} and {A, C, }. However, {B, C, D} is added to the tree, because the only missing parent, {C, D}, has a different minimum attribute. This is a special case, and also {C, D} is added to a temporary collection for frequency counting. Sets {B, C, } and {B, D, } are not added, because {B, } is missing. After frequency counting, non-p S candidates will be removed from the tree. D C B C D D D D A Figure 2. A complete enumeration tree (dash line) and an example tree (solid line) Pseudocode The StatApriori algorithm is given in Figures 3, 4, and 5. In the pseudocode it is assumed that the measure function I for statistical significance is increasing Time complexity It is known that the problem of searching all frequent attribute sets is NP -hard, in the terms of the number of attributes, k [11]. The worst case happens, when the most significant association rule involves all k attributes, and all 2 k attribute sets are generated. The worst case complexity of the algorithm is O(max{k 2, nk}2 k ). Usually, when k < n, this reduces to O(n 2 2 k ). Theorem 2. The worst-case time complexity of StatApriori is O(max{k 2, nk}2 k ), where n is the number of rows and k is the number of attributes. C D B C D D 6
7 Figure 3. Algorithm StatApriori. Input: set of attributes R, data set r, increasing measure function I, threshold K Output: non-redundant rules which are significant Method: // Initialization 1 S 1 = {A i R P S(A i ) = 1} 2 l = 1 3 while (S l ) 4 // Step 1: Candidate generation 5 C l+1 =GenCands(S l, l) 6 // Step 2: Pruning 7 S l+1 =PruneCands(C l+1, l + 1, K) 8 l = l for all X i l S l such that ( X i.redundant) and (X i.max I K) 10 output bestrule(x i ) Figure 5. Algorithm PruneCands. Input: l-candidates C l, size l, threshold K Output: potentially significant l-sets S l Method: 1 S l = ; 2 for all X i C l, Y j SpecP ar 3 count frequencies P (X i ) and P (Y j ) from r 4 for all X i C l 5 max γ = P (minattr(x)) 1 // check if X i is P S and its descendants can // produce non-redundant rules 6 if (PS(P (X i ), max γ, K) and ( Redundant(P (X i ), max γ )) 7 add X i to S l 8 if i(bestrule(x i )) == upperbound(i(bestrule(x i ))) 9 X i.minimal = 1 // check if X i is redundant; its descendants can // still be non-redundant 10 if (Redundant(P (X i ), γ(bestrule(x i )))) 11 X i.redundant = 1 12 X i.max I = max{y j.max I Y j P arents(x i )} 13 return S l ; Figure 4. Algorithm GenCands. Input: potentially significant l-sets S l, size l Output: (l + 1)-candidates C l+1, special parents SpecP ar Method: 1 C l+1 = ; SpecP ar = ; // X i and X j have l 1 common attributes, the same // minimum attribute and none of them is minimal 2 for all X i, X j S l such that (( X i X j == l 1) and Minimal(X i ) and Minimal(X j )) // check the other parents with the same minattr 3 if Z X i X j such that (( Z = l) and (minattr(z) == minattr(x i )) if ( Minimal(Z) and Z S l 1 ) 4 add X i X j to C l+1 // check special case 5 if ((X i X j ) \ minattr(x i ) / S l ) 6 add (X i X j ) \ minattr(x i ) to SpecP ar 7 return (C l+1, SpecP ar) Proof. The initialization (generation of 1-sets) takes n k steps. Producing l-sets and their best rules takes l 2 C l + 2n C l k + 2 S l l time steps. The first term is the time complexity of the candidate generation. ach candidate has l parents and each parent can be found in the trie in l 1 steps. The second term is the complexity of the frequency counting. The database is read (n rows) and on each row C l candidates are checked. In the worst case, all candidates have an extra parent which has to be checked, too. Checking takes in the worst case log k steps, when the data is stored as bit vectors and inclusion test is implemented with logical bit operations. The third term is the complexity of the rule selection phase: for each of S l sets, all l parents are checked. Checking is done at most twice: once for calculating the maximal I-value (selecting the best rule) and second time for checking the redundancy. ach checking can be implemented in constant time, if the parent pointers are stored into a temporary structure in candidate generation phase. Since C l S l, the total complexity is k max{l 2, n} C l < max{k 2, n} l=2 k l=2 ( ) k l = O(max{k 2, nk}2 k ). 7
8 5. xperiments The main goal of the experiments was to evaluate the speed accuracy ratio of the StatApriori algorithm. ven a clever algorithm is worthless unless it can produce better results or perform faster than the existing methods. It was expected that StatApriori cannot compete in speed with the traditional methods, but instead it is likely to produce more accurate rules. The data sets and test parameters are described in Table 2. The first four data sets are classical benchmark data sets from FIMI repository for frequent item set mining (http: //fimi.cs.helsinki.fi/). Plants lists all plant species growing in the U.S.A. and Canada. ach row contains a species and its distribution (states where it grows). The data has been extracted from the USDA plants database ( Garden lists recommended plant combinations. The data is extracted from several gardening sources (e.g. baygardens.tripod.com/). ach data set was tested with two minimum confidences, 0.90 and The goal was to find both strong (and probably accurate) rules and strong correlations. For all tested measures we have calculated the average prediction accuracy (error in the test set) and lift among 100 best rules during 10 executions. All experiments were executed on Intel Core Duo processor T GHz with 1 GB RAM, 2MG cache, and Linux operating system. The quality of rules is summarized in Table 3. In Stat- Apriori, the main measure function was the z-score, but the binomial probabilities (p-values) were used for redundancy reduction, too. A rule was considered redundant, if it had either a lower z-score than its parent rules or if the lower bound of log(p) was higher than the minimum upperbound for parents log(p). This strategy proved to be efficient when the frequencies become low and z-scores expand. For comparison, the rules we selected with the χ 2 - measure, J-measure, z-score and frequency, after normal frequency-based pruning. StatApriori produced very accurate results on all data sets, except Chess and Garden. The latter was difficult for all measures, because all patterns are very rare. For proper analysis, an ontology of Genus Species Subspecies Variety should be used. The poor behaviour on Chess is harder to explain. For all other measures the rules were selected with exceptionally high minimum frequency (min fr = 0.75). This means that the consequence holds on at least 75% of rows and the error is less than 25%, even if the antecedent is empty. In fact, the rules did not represent any correlations, but the consequents were totally independent from the antecedents. In all cases, StatApriori produced the strongest lift. This is understandable, because the statistical significance measures the correlations. When the z-score was used with the minimum frequency thresholds, the lift values were much smaller. The accuracy was also poorer, which suggests that the z-score suffers for frequency-based pruning. Quite likely, the same holds for the chi 2. It is noteworthy that the StatApriori performed faster than the traditional Apriori in all test cases, even if no minimum frequency thresholds were used. The maximum execution time was for Chess, 110s. The large minimum frequencies for the Apriori are partly due to heavy postprocessing. For feasibility, the thresholds were set to avoid over rules. However, the dense data sets are difficult for Apriori even without this restriction. For example, Apriori cannot handle Chess with min fr < Conclusions Searching statistically significant association rules is an important but neglected problem. So far, it has been considered computationally infeasible with any larger data sets. In this paper, we have shown that its is possible to search all statistically significant rules in a reasonable time. We have introduced a set of effective pruning properties and a breadth-first search strategy, StatApriori, which implements them. StatApriori can be used in two ways: either to search K most significant association rules or all rules passing the given significance threshold (minimum z-score). This enables the user to solve the multiple testing problem (i.e. setting the significance threshold) in a desired way or use the algorithm only for ranking the most significant rules. In the same time, StatApriori solves another important problem, and prunes out all redundant association rules. According to experimental results, this improves the rule quality by avoiding overfitting. Together, the z-score and redundancy reduction provide a robust method for rule discovery. I.e. the discovered rules have a high probability to hold in the future data. As far as we know, this is the first algorithm of its kind. The few existing algorithms have either searched only classification rules with statistical measures or used frequencybased pruning in some extent. Both of these strategies are likely to lose significant association rules. In the future research, we are going to improve the efficiency further, using a suitable indexing structure or additional pruning criteria. The final goal is to develop an efficient algorithm for searching the most significant, general association rules, containing propositional logic formulas. 7. Acknowledgments We thank Finnish Concordia Fund ( konkordia-liitto.com/) for supporting this re- 8
9 search. References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages , Washington, D.C., [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB 94, pages Morgan Kaufmann, [3] A. Agresti and Y. Min. Frequentist performance of bayesian confidence intervals for comparing proportions in 2 2 contingency tables. Biometrics, 61:515523, [4]. Baralis and P. Garza. A lazy approach to pruning classification rules. In Proceedings of the 2002 I International Conference on Data Mining (ICDM 02), page 35. I Computer Society, [5] Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal. Mining minimal non-redundant association rules using frequent closed itemsets. In Proceedings of the First International Conference on Computational Logic (CL 00), volume 1861 of Lecture Notes in Computer Science, pages Springer-Verlag, [6] F. Berzal, I. Blanco, D. Sánchez, and M. A. V. Miranda. A new framework to assess association rules. In Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis (IDA 01), volume 2189 of Lecture Notes In Computer Science, pages , London, UK, Springer-Verlag. [7] K. Carriere. How good is a normal approximation for rates and proportions of low incidence events? Communications in Statistics: Simulation and Computation, 30: , [8] D. Freedman, R. Pisani, and R. Purves. Statistics. Norton & Company, London, 4th edition, [9] L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. ACM Computing Surveys, 38(3):9, [10] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:1330, [11] C. Jermaine. Finding the most interesting correlations in a database: how hard can it be? Information Systems, 30(1):21 46, [12] B. Lindgren. Statistical Theory. Chapman & Hall, Boca Raton, U.S.A., 4th edition, [13] H. Mannila, H. Toivonen, and A. Verkamo. fficient algorithms for discovering association rules. In Papers from the AAAI Workshop on Knowledge Discovery in Databases (KDD 94), pages AAAI Press, [14] R. Meo. Theory of dependence values. ACM Transactions on Database Systems, 25(3): , [15] S. Morishita and A. Nakaya. Parallel branch-and-bound graph search for correlated association rules. In Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, volume 1759 of Lecture Notes in Computer Science, pages Springer-Verlag, [16] S. Morishita and J. Sese. Transversing itemset lattices with statistical metric pruning. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS 00), pages ACM Press, [17] S. Nijssen and J. Kok. Multi-class correlated pattern mining. In Proceedings of the 4th International Workshop on Knowledge Discovery in Inductive Databases, volume 3933 of Lecture Notes in Computer Science. Springer-Verlag, [18] G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. Frawley, editors, Knowledge Discovery in Databases, pages AAAI/MIT Press, [19] J. Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46: , [20] C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2(1):39 68, [21] P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right objective measure for association analysis. Information Systems, 29(4): , [22] H. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H. Mannila, M. Herr, and J. Kere. Data mining applied to linkage disequilibrium mapping. American Journal of Human Genetics, 67: , [23] G. Webb. Discovering significant rules. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 06), pages , New York, USA, ACM Press. [24] G. I. Webb. Discovering significant patterns. Machine Learning, 68(1):1 33,
10 Table 2. Description of data sets and test parameters. Data StatApr Apriori n k min cf min Z min fr 1a Mushroom b Mushroom a Chess b Chess a T10I4D100K b T10I4D100K a T40I10D100K b T40I10D100K a Plants b Plants a Garden b Garden Table 3. Average rule accuracy and lift with different measure functions. StatApr Apriori test Z&p chi 2 J z fr γ err γ err γ err γ err γ err 1a b a b a b a b a b a b
Guaranteeing the Accuracy of Association Rules by Statistical Significance
Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge
More informationEfficient search of association rules with Fisher s exact test
fficient search of association rules with Fisher s exact test W. Hämäläinen Abstract limination of spurious and redundant rules is an important problem in the association rule discovery. Statistical measure
More informationEncyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen
Book Title Encyclopedia of Machine Learning Chapter Number 00403 Book CopyRight - Year 2010 Title Frequent Pattern Author Particle Given Name Hannu Family Name Toivonen Suffix Email hannu.toivonen@cs.helsinki.fi
More informationOn Minimal Infrequent Itemset Mining
On Minimal Infrequent Itemset Mining David J. Haglin and Anna M. Manning Abstract A new algorithm for minimal infrequent itemset mining is presented. Potential applications of finding infrequent itemsets
More informationPositive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise
Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise Guimei Liu 1,2 Jinyan Li 1 Limsoon Wong 2 Wynne Hsu 2 1 Institute for Infocomm Research, Singapore 2 School
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan
More informationChapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining
Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of
More informationMining Positive and Negative Fuzzy Association Rules
Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.
More informationRemoving trivial associations in association rule discovery
Removing trivial associations in association rule discovery Geoffrey I. Webb and Songmao Zhang School of Computing and Mathematics, Deakin University Geelong, Victoria 3217, Australia Abstract Association
More informationCS 584 Data Mining. Association Rule Mining 2
CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M
More informationLecture Notes for Chapter 6. Introduction to Data Mining
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
More informationFree-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries
Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries To appear in Data Mining and Knowledge Discovery, an International Journal c Kluwer Academic Publishers
More informationA New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border
A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border Guimei Liu a,b Jinyan Li a Limsoon Wong b a Institute for Infocomm Research, Singapore b School of
More informationEfficient discovery of the top-k optimal dependency rules with Fisher s exact test of significance
Efficient discovery of the top-k optimal dependency rules with Fisher s exact test of significance Wilhelmiina Hämäläinen epartment of omputer Science University of Helsinki Finland whamalai@cs.helsinki.fi
More informationEfficient discovery of statistically significant association rules
Efficient discovery of statistically significant association rules Wilhelmiina Hämäläinen Deartment of Comuter Science University of Helsinki Finland whamalai@cs.helsinki.fi ABSTRACT In this aer, we introduce
More informationDATA MINING - 1DL360
DATA MINING - 1DL36 Fall 212" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht12 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No. 04 Association Analysis Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
More informationD B M G Data Base and Data Mining Group of Politecnico di Torino
Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationAssociation Rules. Fundamentals
Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.
Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example
Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationCOMP 5331: Knowledge Discovery and Data Mining
COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei And slides provide by Raymond
More information732A61/TDDD41 Data Mining - Clustering and Association Analysis
732A61/TDDD41 Data Mining - Clustering and Association Analysis Lecture 6: Association Analysis I Jose M. Peña IDA, Linköping University, Sweden 1/14 Outline Content Association Rules Frequent Itemsets
More informationDistributed Mining of Frequent Closed Itemsets: Some Preliminary Results
Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Claudio Lucchese Ca Foscari University of Venice clucches@dsi.unive.it Raffaele Perego ISTI-CNR of Pisa perego@isti.cnr.it Salvatore
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization
More informationMining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies
Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Juliano Brito da Justa Neves 1 Marina Teresa Pires Vieira {juliano,marina}@dc.ufscar.br Computer Science Department
More informationFree-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries
Data Mining and Knowledge Discovery, 7, 5 22, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency
More informationEfficient search for statistically significant dependency rules in binary data
Department of Computer Science Series of Publications A Report A-2010-2 Efficient search for statistically significant dependency rules in binary data Wilhelmiina Hämäläinen To be presented, with the permission
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 6
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights
More informationChapter 4: Frequent Itemsets and Association Rules
Chapter 4: Frequent Itemsets and Association Rules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision
More informationDATA MINING - 1DL360
DATA MINING - DL360 Fall 200 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht0 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala
More informationΣοφια: how to make FCA polynomial?
Σοφια: how to make FCA polynomial? Aleksey Buzmakov 1,2, Sergei Kuznetsov 2, and Amedeo Napoli 1 1 LORIA (CNRS Inria NGE Université de Lorraine), Vandœuvre-lès-Nancy, France 2 National Research University
More informationBottom-Up Propositionalization
Bottom-Up Propositionalization Stefan Kramer 1 and Eibe Frank 2 1 Institute for Computer Science, Machine Learning Lab University Freiburg, Am Flughafen 17, D-79110 Freiburg/Br. skramer@informatik.uni-freiburg.de
More informationMining Approximative Descriptions of Sets Using Rough Sets
Mining Approximative Descriptions of Sets Using Rough Sets Dan A. Simovici University of Massachusetts Boston, Dept. of Computer Science, 100 Morrissey Blvd. Boston, Massachusetts, 02125 USA dsim@cs.umb.edu
More informationAssociation Rule. Lecturer: Dr. Bo Yuan. LOGO
Association Rule Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Overview Frequent Itemsets Association Rules Sequential Patterns 2 A Real Example 3 Market-Based Problems Finding associations
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationDetecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.
Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Sánchez 18th September 2013 Detecting Anom and Exc Behaviour on
More informationInduction of Decision Trees
Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.
More informationA Concise Representation of Association Rules using Minimal Predictive Rules
A Concise Representation of Association Rules using Minimal Predictive Rules Iyad Batal and Milos Hauskrecht Department of Computer Science University of Pittsburgh {iyad,milos}@cs.pitt.edu Abstract. Association
More informationData mining, 4 cu Lecture 5:
582364 Data mining, 4 cu Lecture 5: Evaluation of Association Patterns Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Evaluation of Association Patterns Association rule algorithms
More informationSelecting a Right Interestingness Measure for Rare Association Rules
Selecting a Right Interestingness Measure for Rare Association Rules Akshat Surana R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad
More informationDescribing Data Table with Best Decision
Describing Data Table with Best Decision ANTS TORIM, REIN KUUSIK Department of Informatics Tallinn University of Technology Raja 15, 12618 Tallinn ESTONIA torim@staff.ttu.ee kuusik@cc.ttu.ee http://staff.ttu.ee/~torim
More informationDATA MINING LECTURE 3. Frequent Itemsets Association Rules
DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.
More informationMining Free Itemsets under Constraints
Mining Free Itemsets under Constraints Jean-François Boulicaut Baptiste Jeudy Institut National des Sciences Appliquées de Lyon Laboratoire d Ingénierie des Systèmes d Information Bâtiment 501 F-69621
More informationCHAPTER-17. Decision Tree Induction
CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes
More informationChapter 2 Quality Measures in Pattern Mining
Chapter 2 Quality Measures in Pattern Mining Abstract In this chapter different quality measures to evaluate the interest of the patterns discovered in the mining process are described. Patterns represent
More informationCSC 411 Lecture 3: Decision Trees
CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 03-Decision Trees 1 / 33 Today Decision Trees Simple but powerful learning
More informationReductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York
Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association
More informationDiscovery of Functional and Approximate Functional Dependencies in Relational Databases
JOURNAL OF APPLIED MATHEMATICS AND DECISION SCIENCES, 7(1), 49 59 Copyright c 2003, Lawrence Erlbaum Associates, Inc. Discovery of Functional and Approximate Functional Dependencies in Relational Databases
More informationConstraint-Based Rule Mining in Large, Dense Databases
Appears in Proc of the 5th Int l Conf on Data Engineering, 88-97, 999 Constraint-Based Rule Mining in Large, Dense Databases Roberto J Bayardo Jr IBM Almaden Research Center bayardo@alummitedu Rakesh Agrawal
More informationCS6375: Machine Learning Gautam Kunapuli. Decision Trees
Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s
More informationInteresting Patterns. Jilles Vreeken. 15 May 2015
Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the Day What is interestingness? what is a pattern? and how can we mine interesting patterns? What is a pattern? Data Pattern y = x - 1 What
More informationMeelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05
Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population
More informationData Analytics Beyond OLAP. Prof. Yanlei Diao
Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of
More informationStandardizing Interestingness Measures for Association Rules
Standardizing Interestingness Measures for Association Rules arxiv:138.374v1 [stat.ap] 16 Aug 13 Mateen Shaikh, Paul D. McNicholas, M. Luiza Antonie and T. Brendan Murphy Department of Mathematics & Statistics,
More informationA Global Constraint for Closed Frequent Pattern Mining
A Global Constraint for Closed Frequent Pattern Mining N. Lazaar 1, Y. Lebbah 2, S. Loudni 3, M. Maamar 1,2, V. Lemière 3, C. Bessiere 1, P. Boizumault 3 1 LIRMM, University of Montpellier, France 2 LITIO,
More informationStatistically Significant Dependencies for Spatial Co-location Pattern Mining and Classification Association Rule Discovery.
Statistically Significant Depencies for Spatial Co-location Pattern Mining and Classification Association Rule Discovery by Jundong Li A thesis submitted in partial fulfillment of the requirements for
More informationCube Lattices: a Framework for Multidimensional Data Mining
Cube Lattices: a Framework for Multidimensional Data Mining Alain Casali Rosine Cicchetti Lotfi Lakhal Abstract Constrained multidimensional patterns differ from the well-known frequent patterns from a
More informationCS 484 Data Mining. Association Rule Mining 2
CS 484 Data Mining Association Rule Mining 2 Review: Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due
More informationPattern-Based Decision Tree Construction
Pattern-Based Decision Tree Construction Dominique Gay, Nazha Selmaoui ERIM - University of New Caledonia BP R4 F-98851 Nouméa cedex, France {dominique.gay, nazha.selmaoui}@univ-nc.nc Jean-François Boulicaut
More informationData Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018
Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model
More informationAssocia'on Rule Mining
Associa'on Rule Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 4 and 7, 2014 1 Market Basket Analysis Scenario: customers shopping at a supermarket Transaction
More informationAssociation Analysis Part 2. FP Growth (Pei et al 2000)
Association Analysis art 2 Sanjay Ranka rofessor Computer and Information Science and Engineering University of Florida F Growth ei et al 2 Use a compressed representation of the database using an F-tree
More informationMaintaining Frequent Itemsets over High-Speed Data Streams
Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,
More informationAlgorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview
Algorithmic Methods of Data Mining, Fall 2005, Course overview 1 Course overview lgorithmic Methods of Data Mining, Fall 2005, Course overview 1 T-61.5060 Algorithmic methods of data mining (3 cp) P T-61.5060
More informationApriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation 12.3.2008 Lauri Lahti Association rules Techniques for data mining and knowledge discovery in databases
More informationFrequent Itemset Mining
ì 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM COCONUT Team (PART I) IMAGINA 17/18 Webpage: http://www.lirmm.fr/~lazaar/teaching.html Email: lazaar@lirmm.fr 2 Data Mining ì Data Mining (DM) or Knowledge
More informationTransaction Databases, Frequent Itemsets, and Their Condensed Representations
Transaction Databases, Frequent Itemsets, and Their Condensed Representations Taneli Mielikäinen HIIT Basic Research Unit Department of Computer Science University of Helsinki, Finland Abstract. Mining
More informationQuantitative Association Rule Mining on Weighted Transactional Data
Quantitative Association Rule Mining on Weighted Transactional Data D. Sujatha and Naveen C. H. Abstract In this paper we have proposed an approach for mining quantitative association rules. The aim of
More informationMining Molecular Fragments: Finding Relevant Substructures of Molecules
Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli
More informationDATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms
DATA MINING LECTURE 4 Frequent Itemsets, Association Rules Evaluation Alternative Algorithms RECAP Mining Frequent Itemsets Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset
More informationDecision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,
More informationDecision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore
Decision Trees Claude Monet, The Mulberry Tree Slides from Pedro Domingos, CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Michael Guerzhoy
More informationConnections between mining frequent itemsets and learning generative models
Connections between mining frequent itemsets and learning generative models Srivatsan Laxman Microsoft Research Labs India slaxman@microsoft.com Prasad Naldurg Microsoft Research Labs India prasadn@microsoft.com
More informationNaive Bayesian classifiers for multinomial features: a theoretical analysis
Naive Bayesian classifiers for multinomial features: a theoretical analysis Ewald van Dyk 1, Etienne Barnard 2 1,2 School of Electrical, Electronic and Computer Engineering, University of North-West, South
More informationIntroduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar
Data Mining Chapter 5 Association Analysis: Basic Concepts Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 2/3/28 Introduction to Data Mining Association Rule Mining Given
More informationLayered critical values: a powerful direct-adjustment approach to discovering significant patterns
Mach Learn (2008) 71: 307 323 DOI 10.1007/s10994-008-5046-x TECHNICAL NOTE Layered critical values: a powerful direct-adjustment approach to discovering significant patterns Geoffrey I. Webb Received:
More informationRegression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules
International Journal of Innovative Research in Computer Scien & Technology (IJIRCST) Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules Mir Md Jahangir
More informationPrinciples of AI Planning
Principles of AI Planning 5. Planning as search: progression and regression Albert-Ludwigs-Universität Freiburg Bernhard Nebel and Robert Mattmüller October 30th, 2013 Introduction Classification Planning
More informationPrinciples of AI Planning
Principles of 5. Planning as search: progression and regression Malte Helmert and Bernhard Nebel Albert-Ludwigs-Universität Freiburg May 4th, 2010 Planning as (classical) search Introduction Classification
More informationInvestigating Measures of Association by Graphs and Tables of Critical Frequencies
Investigating Measures of Association by Graphs Investigating and Tables Measures of Critical of Association Frequencies by Graphs and Tables of Critical Frequencies Martin Ralbovský, Jan Rauch University
More informationCPDA Based Fuzzy Association Rules for Learning Achievement Mining
2009 International Conference on Machine Learning and Computing IPCSIT vol.3 (2011) (2011) IACSIT Press, Singapore CPDA Based Fuzzy Association Rules for Learning Achievement Mining Jr-Shian Chen 1, Hung-Lieh
More informationMining Literal Correlation Rules from Itemsets
Mining Literal Correlation Rules from Itemsets Alain Casali, Christian Ernst To cite this version: Alain Casali, Christian Ernst. Mining Literal Correlation Rules from Itemsets. IMMM 2011 : The First International
More informationComparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees
Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland
More informationFUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH
FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH M. De Cock C. Cornelis E. E. Kerre Dept. of Applied Mathematics and Computer Science Ghent University, Krijgslaan 281 (S9), B-9000 Gent, Belgium phone: +32
More informationChapter 3 Deterministic planning
Chapter 3 Deterministic planning In this chapter we describe a number of algorithms for solving the historically most important and most basic type of planning problem. Two rather strong simplifying assumptions
More informationOn the robustness of association rules
On the robustness of association rules Philippe Lenca, Benoît Vaillant, Stéphane Lallich GET/ENST Bretagne CNRS UMR 2872 TAMCIC Technopôle de Brest Iroise CS 8388, 29238 Brest Cedex, France Email: philippe.lenca@enst-bretagne.fr
More informationMining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength
Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength Xiangjun Dong School of Information, Shandong Polytechnic University Jinan 250353, China dongxiangjun@gmail.com
More informationCS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets
CS-C3160 - Data Science Chapter 8: Discrete methods for analyzing large binary datasets Jaakko Hollmén, Department of Computer Science 30.10.2017-18.12.2017 1 Rest of the course In the first part of the
More informationData Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29
Data Mining and Knowledge Discovery Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2011/11/29 1 Practice plan 2011/11/08: Predictive data mining 1 Decision trees Evaluating classifiers 1: separate test set,
More informationMachine Learning: Pattern Mining
Machine Learning: Pattern Mining Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008 Pattern Mining Overview Itemsets Task Naive Algorithm Apriori Algorithm
More informationCS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón
CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING Santiago Ontañón so367@drexel.edu Summary so far: Rational Agents Problem Solving Systematic Search: Uninformed Informed Local Search Adversarial Search
More informationFrequent Pattern Mining: Exercises
Frequent Pattern Mining: Exercises Christian Borgelt School of Computer Science tto-von-guericke-university of Magdeburg Universitätsplatz 2, 39106 Magdeburg, Germany christian@borgelt.net http://www.borgelt.net/
More informationConstraint-based Subspace Clustering
Constraint-based Subspace Clustering Elisa Fromont 1, Adriana Prado 2 and Céline Robardet 1 1 Université de Lyon, France 2 Universiteit Antwerpen, Belgium Thursday, April 30 Traditional Clustering Partitions
More informationPattern Structures 1
Pattern Structures 1 Pattern Structures Models describe whole or a large part of the data Pattern characterizes some local aspect of the data Pattern is a predicate that returns true for those objects
More informationLearning Decision Trees
Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2
More informationFrom inductive inference to machine learning
From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences
More informationApproximating a Collection of Frequent Sets
Approximating a Collection of Frequent Sets Foto Afrati National Technical University of Athens Greece afrati@softlab.ece.ntua.gr Aristides Gionis HIIT Basic Research Unit Dept. of Computer Science University
More informationLevelwise Search and Borders of Theories in Knowledge Discovery
Data Mining and Knowledge Discovery 1, 241 258 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Levelwise Search and Borders of Theories in Knowledge Discovery HEIKKI MANNILA
More information