Theory of Dependence Values

Size: px
Start display at page:

Download "Theory of Dependence Values"

Transcription

1 Theory of Dependence Values ROSA MEO Università degli Studi di Torino A new model to evaluate dependencies in data mining problems is presented and discussed. The well-known concept of the association rule is replaced by the new definition of dependence value, which is a single real number uniquely associated with a given itemset. Knowledge of dependence values is sufficient to describe all the dependencies characterizing a given data mining problem. The dependence value of an itemset is the difference between the occurrence probability of the itemset and a corresponding maximum independence estimate. This can be determined as a function of joint probabilities of the subsets of the itemset being considered by maximizing a suitable entropy function. So it is possible to separate in an itemset of cardinality k the dependence inherited from its subsets of cardinality k 1 and the specific inherent dependence of that itemset. The absolute value of the difference between the probability P(i) of the event i that indicates the presence of the itemset {a,b,...} and its maximum independence estimate is constant for any combination of values of a, b,.... In addition, the Boolean function specifying the combinations of values for which the dependence is positive is a parity function. So the determination of such combinations is immediate. The model appears to be simple and powerful. Categories and Subject Descriptors: H.2.8 [Database Management]: Database applications Data mining; Statistical databases; H.1.1 [Models and Principles]: Systems and Information Theory Information theory; I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods General Terms: Algorithms, Experimentation, Theory Additional Key Words and Phrases: Association rules, dependence rules, entropy, variables independence 1. INTRODUCTION A well-known problem in data mining is the search for association rules, a powerful and intuitive conceptual tool to represent the phenomena that are recurrent in a data set. A number of interesting solutions of that problem have been proposed in the last five years together with as many powerful The author was previously (until November 1999) at Politecnico di Torino. Author s address: Department of Computer Science, Università degli Studi di Torino, corso Svizzera, 185, Torino, 10149, Italy. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee ACM /00/ $5.00 ACM Transactions on Database Systems, Vol. 25, No. 3, September 2000, Pages

2 Theory of Dependence Values 381 algorithms [Agrawal et al. 1993b; 1995; Agrawal and Srikant 1994; Savasere et al. 1995; Han and Fu 1995; Park et al. 1995; Toivonen 1996; Brin et al. 1997; Lin and Kedem 1998]. They are used in many application fields, such as analysis of supermarket basket data, failures in telecommunication networks, medical test results, lexical features of texts, etc.. An association rule is an expression of the form X f Y, where X and Y are sets of items that are often found together in a given collection of data. For example, the expression {milk, coffee} f {bread, sugar} might mean that a customer purchasing milk and coffee is likely to also purchase bread and sugar. The validity of an association rule has been based on certain measures. The first measure, called support, is the percentage of transactions of the database containing both X and Y. The second one, called confidence, isthe probability that, if X is purchased, Y is also purchased. In the case of the previous example, a support value of 2% and a confidence value of 15% would mean that 2% of all the customers buy milk, coffee, bread, and sugar, and that 15% of the customers that buy milk and coffee also buy bread and sugar. Recently, Silverstein et al. [1998] have presented a critique of the concept of association rule and the related support-confidence framework. They have observed that the association rule model is well-suited to the market basket problem, but that it does not address other data mining problems. In place of association rules and the support-confidence framework, Silverstein et al. propose a statistical approach based on the chi-squared measure and a new model of rules, called dependence rules. This work can be viewed as a continuation of the line of rules, even if the model and the tools proposed here are rather different, and in particular, the concept of dependence rules has been replaced by the concept of dependence values. This article is organized as follows. Section 2 contains a summary of the main results of earlier work, the emphasis being placed on the supportconfidence framework, the critique of this model by Silverstein et al. and the concept of dependence rules in opposition to the one of association rules. Section 3 contains the definition of dependence value and other basic definitions of the model proposed here as well as the theorems following from these definitions. These theorems suggest an easy and quick way to determine the dependence values, which is described in Section 5, whereas Section 4 discusses the use of the well-known concept of entropy as a tool to evaluate the relevance of a dependence rule. Finally, Section 6 draws the conclusions. 2. ASSOCIATION RULES AND DEPENDENCE RULES As mentioned, this section contains a summary of earlier work on association rules. For ease of reference, the notation used by Silverstein et al. is adopted here.

3 382 R. Meo 2.1 Association Rules Let I i 1, i 2,...,i k be a set of k elements, called items. Basket of items is any subset of I. For example, in the market basket application, I {milk, coffee, bread, sugar, tea,...} contains all the items stocked by a supermarket and a basket of items such as {milk, coffee, bread, sugar} is the set of purchases from one register transaction. As a second example, in the document basket application, I is the set of all the dictionary words and each basket is the set of all the words used in a given document. An association rule X f Y, where X and Y are disjoint subsets of I, was defined by Agrawal et al. [1993b] as follows. X f Y if, and only if, X Y is a subset of at least s% (the support) of all the baskets, and of all the baskets containing all the items of X at least c% (the confidence) contain all the items of Y. The concept of association rules and the related support confidence framework are very powerful and useful, but they suffer from some limitation, especially when the absence of items is considered. An interesting example proposed by Silverstein et al. is the following. Consider the purchase of tea (t) and coffee (c) in a grocery store and assume the probabilities: P c, t 0.2 P c, t 0.7 P c, t 0.05 P c, t 0.05 where c and t denote the events coffee not purchased and tea not purchased, respectively. According to the preceding definitions, the potential rule tea f coffee has a support equal to 20% and a confidence equal to 80%, and therefore can be considered as a valid association rule. However, a deeper analysis shows that a customer buying tea is less likely to also buy coffee than a customer not buying tea (80% against more than 90%). We would write tea f coffee, but on the contrary, the strongest positive dependence is between the absence of coffee and the presence of tea. 2.2 Dependence Rules Silverstein et al. propose a view of basket data in terms of Boolean indicator variables, as follows. Let I 1, I 2,...,I k be a set of k Boolean variables called attributes. A set of baskets b 1, b 2,..., b n is a collection of the n k-tuples from TRUE, FALSE k which represent a collection of value assignments to the k

4 attributes. Assigning the value TRUE to an attribute variable I j in a basket represents the presence of item i j in the basket. The event a denotes A TRUE or, equivalently, the presence of the corresponding item a in a basket. The complementary event a denotes A FALSE, or, the absence of item a from a basket. The probability that item a appears in a random basket is denoted P(a) P(A TRUE). Likewise, P(a, b) P(A TRUE B FALSE) is the probability that item a is present and item b is absent. Silverstein et al. have proposed the following definitions of independence and dependence of events and variables. Definition 1. Definition 2. Theory of Dependence Values 383 Two events x and y are independent if P x y P x P y. Two variables A and B are independent if P A v a B v b P A v a P B v b for all possible values v a, v b TRUE, FALSE. Definition 3. Events, or variables, that are not independent are dependent. Definition 4. Let I be a set of attribute variables. We say that the set I is a dependence rule if I is dependent. The following Theorem 1 is based on the preceding Definitions 1 through 4. THEOREM 1. If a set of variables I is dependent, so is every superset of I. Theorem 1 is important in the dependence rule model, because it makes it possible to restrict attention to the set of minimally dependent itemsets, where a minimally dependent itemset I is such if it is dependent, but none of its subsets is dependent. Silverstein et al. have proposed using the X 2 test for independence to identify dependence rules. The X 2 statistic is upward-closed with respect to the lattice of all possible itemsets, as well as dependence rules. In other words, if a set I of items is deemed dependent at significance level, then all supersets of I are also dependent at the same significance level and, therefore, they do not need to be examined for dependence or independence. 3. DEPENDENCE VALUES In this section the new model based on the concept of dependence values is presented and discussed. A theorem proved in this section provides the basic tools to evaluate the dependence rules of a certain itemset. To simplify the presentation, we proceed from the simplest cases towards the most complex ones, in the order of increasing cardinality of itemsets. In other words, we discuss dependence rules first for pairs of items, then for triplets of items, and finally for m-plets of arbitrary cardinality m.

5 384 R. Meo 3.1 Dependence Rules for Pairs of Items Assume we know the occurrence probabilities of all the items: P I 1 TRUE, P I 2 TRUE,..., P I k TRUE. The evaluation of such probabilities is the first problem of data mining, but it is seldom considered because of its simplicity. Generally, the maximum likelihood estimate is adopted according to which P(a) is assumed equal to O(a)/n, where O(a) is the number of baskets containing a and n is the total number of baskets. However, more complex computations based on Bayes s Theorem might also be used. In the absence of specific determinations, if we know only P A TRUE and P B TRUE, we might formulate the conjectures: P a, b P a P b P a, b P a P b P a, b P a P b P a, b P a P b. These conjectures are equivalent to the assumption that variables A and B are independent. Assume that the exact determination of P(a,b), evaluated as O(a,b)/n (where O(a,b) is the number of baskets containing both a and b), is different from the conjecture P(a,b) P(a) P(b): P a, b P a P b. It is easy to prove the following theorem. THEOREM 2(UNICITY OF THE VALUE FOR SECOND-ORDER PROBABILITIES). P(A TRUE) and P(B TRUE) are known, determination of a single value If P a, b P a P b O a, b n O a n O b n is sufficient to evaluate all the second-order joint probabilities P a, b, P a, b, P a, b. PROOF. The proof is contained in the following simple relationships: P a, b P a P a, b P a P a P b

6 Theory of Dependence Values 385 P a 1 P b P a P b. Analogously, P a, b P b P a, b P b P a P b P b 1 P a P a P b, and P a, b P a P b P a, b P a P b P a P b P a 1 P a P b P a P a P b P a 1 P b P a P b. e The fact that a single datum contains all the information pertaining to joint probabilities of pair {A, B} suggests the following definitions. Definition 5. (Dependence Value of a Pair). The dependence value of the pair {A,B} is defined as the difference P a, b P a P b. Definition 6. (Dependence State of a Pair). If the absolute value of P a, b P a P b exceeds a given threshold th, A and B are said to be dependent. If th dependence is defined as positive; otherwise, it is defined as negative. The following notations D 2 A, B 0 D 2 A, B 0 D 2 A, B 0

7 386 R. Meo Fig. 1. The joint probabilities of P(a,b) in the cells of Karnaugh s map of {A,B}. are adopted to indicate a positive, negative, or no dependence, respectively. Figure 1 shows that the difference between the joint probability P a *, b * (with a * a or a * a and b * b or b * b) and the corresponding a-priori estimate P a * P b * always has the same absolute value but a different sign in the various cells of the Karnaugh s map of variables A and B. To represent this fact, we need another definition and a new theorem. Definition 7. (Dependence Function of Two Variables). The Boolean function of variables A and B, whose minterms correspond to the values v A, v B for which P A v A B v B P A v A P B v B is called the dependence function of variables A and B. THEOREM 3(PARITY OF TWO-VARIABLE DEPENDENCE FUNCTIONS). 0, the dependence function of variables A and B is: A Q B A B A B, If D 2 A, B which is the parity function with parity odd (Figure 2). If D 2 A, B 0, the dependence function of variables A and B is: A Q B A B A B, which is the parity function with parity even (Figure 3). As a simple example, consider the case of purchases of coffee (c) and tea (t), which was discussed in Silverstein et al. [1998] to show the weakness of the traditional support confidence framework (Section 2.1). If then P c, t 0.2 P c, t 0.7 P c, t 0.05 P c, t 0.05 P C TRUE 0.9 P T TRUE 0.25.

8 Theory of Dependence Values 387 Fig. 2. The dependence function of variables A and B if D 2 A, B 0. Fig. 3. The dependence function of variables A and B if D 2 A, B 0. Therefore P(c) P(t) and P c, t P c P t which shows that dependence is negative (D 2 C, T 0). One might wonder whether the usual notation X f Y adopted in several well-known papers on data mining still makes sense and how to indicate a negative dependence such as D 2 C, T 0 ( C f T or C f T or T f C or T f C?) The answer is simple: P c, t P c P t or, simply, D 2 c, t 0 contain all information on the second-order dependencies. However, one might argue that is more significant for the events having a lower probability. In the case of coffee and tea, P( c,t) 0.05 is the lowest probability in the cells of the dependence function; therefore, it is not completely unreasonable to write: C f T. 3.2 Dependence Rules for Triplets of Items This section is devoted to the generalization of definitions and theorems presented in Section 3.1 to the case of triplets of items. As we show, such generalization implies some new problems. Consider the case of a triplet of the Boolean variables A, B, and C, and assume we know the first- and second-order joint probabilities such as P(a b) P(a,b)/P(b), and others. We are interested in determining the third-order joint probabilities of triplets such as P(a,b,c), P(a,b, c ), and so on, from which the third-order conditional probabilities such as P(a b,c) P(A TRUE B TRUE,C TRUE) follow directly. The following theorem shows that knowledge of a single third-order probability is sufficient to determine all the third-order probabilities. THEOREM 4(UNICITY OF THE VALUE FOR THIRD-ORDER PROBABILITIES). All the third-order joint probabilities can be calculated as functions of first- and second-order joint probabilities and a single datum such as a third-order joint probability. PROOF. Assume, for example, we know P(a,b,c). The other joint probabilities can be determined as follows.

9 388 R. Meo P a, b, c P a, b P a, b, c P a, b, c P a, c P a, b, c P a, b, c P b, c P a, b, c P a, b, c P a, c P a, b, c P a, b, c P a, b P a, b, c P a, b, c P a, c P a, b, c P a, b, c P a, b P a, b, c. e Theorem 4 may be viewed as an extension of Theorem 2 on the unicity of the value for second-order probabilities shown in Section 3.1. However, Theorem 2 makes reference to the differences between the determined P a *, b * and the estimated P a * P b * which correspond to the conjecture of independence of a and b. In the case of triplets, the condition of independence is more difficult to identify. Our proposal is contained in the following considerations. The relationships written in the proof of Theorem 4 can also be formulated as: P a, b, c x P a, b, c P a, b x P a, b, c P a, c x P a, b, c P b, c x P a, b, c P a, c P a, b x P a, b, c P a, b P b, c x P a, b, c P a, c P b, c x P a, b, c P a, b P a, c P b, c x. They express the values of all the third-order joint probabilities as functions of the known second-order probabilities P(a,b), P(a,c),..., and the unknown third-order probability x P(a,b,c). Now consider the entropy E x P a, b, c logp a, b, c P a, b, c logp a, b, c... x logx P a, b x log P a, b x....

10 Theory of Dependence Values 389 This function of the unknown x is the average amount of information needed to know a, b, and c. The maximum value of E(x) is reached when a, b, and c are at the maximum level of independence compatible with the dependencies imposed by the second-order joint probabilities. This consideration explains the following Definition 8. Definition 8. (Maximum Independence Estimate for Third-Order Probabilities) If first- and second-order joint probabilities are known but no information is available on the third-order probabilities, the conjecture x of P(a,b,c) maximizing the joint entropy of A, B, C: E x P a, b, c logp a, b, c (where the sum is to be extended to all the combinations of values of a, b, and c) is defined as the maximum independence estimate. Such maximum independence estimate is denoted with the symbol P a, b, c MI. Analogously, for any combination x *, a *, b *, c * of values of a, b, c, we shall define P a *, b *, c * MI as the value of P a *, b *, c * for which E x * is maximum. Notice that in virtue of Theorem 4, for any combination of values a *, b *, c * of a, b, c, P a *, b *, c * MI can be computed in terms of second-order joint probabilities and P a, b, c MI by applying the relationships P a, b, c MI P b, c P a, b, c MI P a, b, c MI P a, c P a, b, c MI P a, b, c MI P a, b P a, b, c MI P a, b, c MI P a, c P a, b, c MI P a, b, c MI P a, b P a, b, c MI P a, b, c MI P a, c P a, b, c MI P a, b, c MI P a, b P a, b, c MI. The meaning of Definition 8 is rather important for the model presented in this article. If D 2 A, B, D 2 A, C, D 2 B, C are 0 or 0, A, B, C are not independent, but they could own only the dependence inherited from the second-order dependencies, or their dependence might be stronger. In the former case, P a *, b *, c is equal to P a *, b *, c * MI and there is no real third-order dependence. In the latter, there is evidence of a third-order dependence whose value and sign depend on the differences between P a *, b *, c * and P a *, b *, c * MI, as shown in the following analysis.

11 390 R. Meo Notice that in the case of the pairs of items, Definition 8 of maximum independence coincides with the more known definitions of independence cited in Section 3.1. Indeed, in this case, as is shown in Figure 1, the joint entropy of A and B is E P a *, b * logp a *, b * x logx P a x log P a x P b x log P b x 1 P a P b x log 1 P a P b x, where x P(a,b). It is easy to prove that function E has a maximum for x P a, b MI P a P b. By applying the same algorithm, it is easy to prove the analogous results: y P a, b MI z P a, b MI w P a, b MI P a P b P a P b P a P b. Unfortunately, in the case of triplets and k-plets the determination of the maximum independence estimates is not so simple. However, as shown, it is not necessary to know all the estimates P i * 1, i * 2,..., i * k MI as one of them is sufficient to determine all the other ones. In addition, the numerical evaluation of this estimate can be performed very quickly by applying the method described in Section 5. The definition of the maximum independence estimate is applied in the following theorem, which can be viewed as a specification of Theorem 4 on the unicity of value for third-order probabilities and as the natural extension of Theorem 2 proved in Section 3.1. THEOREM 5. If the first- and second-order joint probabilities and the third-order maximum independence estimate are known, a single number defined as the difference P a, b, c P a, b, c MI is sufficient to specify all the third-order joint probabilities. PROOF. Theorem 5 is a direct consequence of Theorem 4 on the unicity of the value. Indeed, from the knowledge of the first- and second-order joint probabilities we can obtain P a, b, c MI and from this P a, b, c P a, b, c MI. But, according to Theorem 4, the knowledge of P(a,b,c) is sufficient to determine all the third-order joint probabilities. e By virtue of Theorem 5, we can state the following definitions which are an extension of Definitions 5 and 6.

12 Theory of Dependence Values 391 Definition 9. (Dependence Value of a Triplet) The dependence value of the triplet {A,B,C} is defined as the difference P a, b, c P a, b, c MI. Definition 10. (Dependence State of a Triplet). If the dependence value of {A,B,C} P a, b, c P a, b, c MI exceeds a given threshold th, A, B, and C are defined as connected by a third-order dependence. If th, dependence is defined as positive; otherwise, it is defined as negative. The following notations D 3 A, B, C 0 D 3 A, B, C 0 D 3 A, B, C 0 are used to indicate the existence of a third-order dependence and its sign. Notice that in the model proposed by Silverstein et al. the existence of one or more second-order dependencies implies the existence of the thirdorder dependence, whereas in our model D 2 A, B, D 2 A, C, D 2 B, C and D 3 A, B, C are independent, in the sense that any combination of their values is possible. For example, even if all three second-order dependencies are positive, D 3 A, B, C might be zero or negative. In Section 3.3, an example about the purchase of a triplet of items and the differences with respect to the other models are discussed. The following Definition 11 on the dependence function of three variables and Theorem 6 extend the statements of Definition 7 on the dependence function for pairs and Theorem 3 on the parity function to third-order dependencies. Definition 11. (Dependence Function of Three Variables) The Boolean function of variables A, B, and C, whose minterms correspond to the values v A, v B, v C for which P A v A B v B C v C P A v A B v B C v C MI is called the dependence function of variables A, B, and C. THEOREM 6. (PARITY OF DEPENDENCE FUNCTIONS OF THREE VARIABLES). D 3 A, B, C 0, the dependence function of variables A, B, and C is A Q B Q C A B C A B C A B C A B C, that is, the parity function with parity even (Figure 4). If D 3 A, B, C 0, the dependence function is If A Q B Q C A B C A B C A B C A B C,

13 392 R. Meo Fig. 4. The dependence function when D 3 A, B, C 0. that is, the parity function with parity odd and the complementary function of the preceding one (Figure 5). PROOF. By definition, P a, b, c P a, b P a, b, c P a, b P a, b, c MI P a, b P a, b, c MI P a, b, c MI. The values presented in Figure 6 follow from analogous computations. From these, it is immediate to derive the two maps of Figures 4 and 5, when D 3 A, B, C 0, ord 3 A, B, C 0, respectively. e Justification of the Maximum Independence Definition. The idea of maximum independence introduced in this article is not intuitively obvious and needs some further justification. First consider the simple case of two variables A and B. In this case, as shown above, the definition of maximum independence coincides with the well-known definition of absolute independence, according to which A and B are independent if, and only if, P A v A, B v B P A v A P B v B for any combination of values of A and B. It is well known that the joint entropy of A and B E(A,B) E(A) E(B A) E (B) E(A B), where E(B A) and E(A B) are the equivocation of B with respect to A and the equivocation of A with respect to B, respectively. Therefore, the maximum value of E(A,B) is reached when E(B A) (or E(A B)) is maximum. When A and B are independent, the amount of information needed to know B, ifa is known, or to know A, ifb is known, is maximum. Notice that in this case E(A B) E(A) and E(B A) E(B). These equalities will not hold in the case of three variables. Now consider the case of three variables A, B, and C. In general, if the probabilities of A, B, and C, and the second-order joint probabilities P(A,B), P(B,C), and P(A,C) have been assigned, there is no assignment of the probability P(A,B,C) for which A, B, and C are independent, that is, P A v A, B v B, C v C P A v A P B v B P C v C for any combination of values of v A, v B, and v C. However, it makes sense to search the value of P(A,B,C) for which the joint entropy E(A,B,C) is maximum and to define that condition as the one of the maximum level of independence compatible with the dependencies imposed by the second-order joint probabilities. Indeed,

14 Theory of Dependence Values 393 Fig. 5. The dependence function when D 3 A, B, C 0. Fig. 6. The dependence function for the three variables A, B, and C. E A, B, C E A, B E C A, B E A, C E B A, C E B, C E A B, C Therefore, since E(A,B), E(A,C), and E(B,C) depend only on the values of second-order probabilities, E(A,B,C) reaches its maximum for that assignment of P(A,B,C) for which E(C A,B), E(B A,C), and E(A B,C)also reach their maximum values. In other words, the maximum independence level corresponds to the condition in which the maximum amount of information is needed to know the value of a variable, the other two being known. However, in general, since A, B, and C are not independent, E A B, C E A B E A and E A B, C E A C E A, and this is different from the case of pairs of variables for which the concepts of maximum independence and absolute independence coincide. 3.3 The Lattice of Dependencies Since the knowledge of the dependence value of an itemset of cardinality k, together with the values of the joint probabilities of all its subsets of cardinality k 1, is sufficient to know the probabilites of all the combinations of its values, the lattice of the itemsets can be adopted to describe the whole system of dependencies of a given database. Of course, in such a lattice every node should be labeled with its associated dependence value. Besides, the nodes at the top of the lattice representing the itemsets of cardinality 1 will be labeled with the values of the differences between the probability estimates, P(a) O(a)/n, P(b) O(b)/n,..., and so on, and the corresponding starting estimates (typically, and in absence of other estimates, equal to 0.5).

15 394 R. Meo c t d ct cd td Fig. 7. ctd The lattice relative to the purchases of coffee (c), tea (t) and doughnuts (d). By way of example, Figure 7 represents the dependence lattice relative to the sample reported by Silverstein et al. in their paper. The following are the data of purchases of coffee (c), tea (t), and doughnuts (d) and their combinations proposed by those authors. P c, t, d 0.08 P c, t, d 0.01 P c, t, d 0.4 P c, t, d 0.02 P c, t, d 0.1 P c, t, d 0.02 P c, t, d 0.35 P c, t, d The dependence values of the nodes of the lattice have been calculated as follows. c P c P c MI O c /n t P t P t MI O t /n d P d P d MI O d /n c, t P c, t P c, t MI O c, t /n P c P t c, d P c, d P c, d MI O c, d /n P c P d t, d P t, d P t, d MI O t, d /n P t P d c, t, d P c, t, d P c, t, d MI O c, t, d /n

16 Theory of Dependence Values 395 Fig. 8. The sign of dependence function for the example of the purchases of coffee (variable C), tea (variable T) and doughnuts (variable D). P c, t, d MI has been computed maximizing the entropy E(x) with x P(c,t,d) as suggested in Definition 8 on the maximum independence estimate. Notice that from the value of (c, t, d) and from Definition 10 on the state of dependencies it follows, for example, that the dependence of itemset {c,t,d} ispositive, whereas, by adopting the model proposed by Silverstein et al., the same dependence, evaluated as P a, b, c P a P b P c would be negative. This is due to the fact that, in the Silverstein et al. model, the dependencies which the subset {c, t, d} has inherited from the subsets {c,t}, {c,d} and {t,d} are not distinguished from the specific inherent dependence. The complete dependence table showing the dependence function signs for all the values of c, t, d is shown in Figure 8. The dependence lattice can also be viewed as a useful tool to display the results of a data mining investigation on a given database. Of course, it will be convenient to display only the sublattice of the nodes having sufficient support and positive or negative dependencies anyway different from zero in a significant way. Often, the dependence value is not necessary, it being sufficient to introduce the indication ( o -) of the dependence state in the lattice produced. 3.4 Dependence Rules for k-plets of Items of Arbitrary Cardinality The case of triplets discussed in Section 3.2 is absolutely general. However, for the sake of completeness, the definitions and the theorems presented in Section 3.2 are extended here to the more general case of k-plets of arbitrary cardinality. For brevity, the proofs of the theorems are omitted, with the exception of Theorem 7, which needs a specific proof. Consider the case of a k-plet of Boolean variables I 1, I 2,...,I k, and assume we know all the joint probabilities up to the order (k 1): P i 1, P i 2,...,P i k 1, P i k ; P i 1, i 2, P i 1, i 3,...,P i k 1, i k ;...; P i 1, i 2,..., i k 1,...,P i 2, i 3,..., i k 1, i k. We want to determine the kth-order joint probabilities like P i 1, i 2,..., i k 1, i k, P i 1, i 2,..., i k 1, i k, and so on. The

17 396 R. Meo following theorem shows that knowledge of a single kth-order joint probability is sufficient to determine all the kth-order probabilities. THEOREM 7(UNICITY OF THE VALUE). All the kth-order joint probabilities can be calculated as functions of the joint probabilities of the orders less than k and a single kth-order joint probability. PROOF. Assume, for example, we know P i 1, i 2,..., i k 1, i k. First, we determine P i 1, i 2,..., i k 1, i k P i 2,..., i k 1, i k. P i 1, i 2,..., i k 1, i k. Analogously, we determine all the other joint probabilities related to elementary conditions in which a single literal is complemented: P i 1, i 2, i 3,..., i k 1, i k P i 1, i 3,..., i k 1, i k P i 1, i 2, i 3,..., i k 1, i k..., P i 1, i 2, i 3,..., i k 1, i k and so on. Then, we compute all the joint probabilities referring to elementary conditions in which two literals appear complemented: P i 1, i 2, i 3,..., i k 1, i k P i 1, i 3,..., i k 1, i k P i 1, i 2, i 3,..., P i 1, i 2, i 3,..., i k 1, i k and so on. In general, in order to determine all the joint probabilities related to elementary conditions containing m complemented literals, we apply the following relationship in which a 1 a 2... a m b 1 b 2... b k m and 1 a i k and 1 b j k, with 1 i m and 1 j k m. P i a1, i a2,..., i am, i b1,..., i bk m P i a1, i a3,..., i am, i b1,..., i bk m P i a1, i a2, i a3,..., i am, i b1,..., i bk m, where at most (m 1) complemented literals appear in the right size. e

18 Theory of Dependence Values 397 Definition 12. (Maximum Independence Estimate). If the joint probabilities up to order k-1 are known but no information is available on the joint probabilities of order k, then the conjecture on P i * 1, i * 2,..., i * k maximizing the joint entropy of I 1, I 2,..., I k, E P i 1 *, i 2 *,..., i k * logp i 1 *, i 2 *,..., i k * is considered as the maximum independence estimate. For any i 1 *, i 2 *,..., i k * the maximum independence estimate is indicated with the symbol P i 1 *, i 2 *,..., i k * MI. Definition 13. (Dependence Value). The difference P i 1, i 2,..., i k P i 1, i 2,..., i k MI is defined as the dependence value of the itemset i 1, i 2,..., i k. THEOREM 8(UNICITY OF THE VALUE). If the joint probabilities up to the order k 1 are known, the knowledge of the dependence value P i 1, i 2,..., i k P i 1, i 2,..., i k MI is sufficient to describe all the kth order joint probabilities. Definition 14. (Dependence State). If the absolute value of the dependence value of I 1, I 2,..., I k exceeds a given threshold th, I 1, I 2,..., I k are defined as connected by a dependence of order k. If th, the dependence is defined as positive; otherwise, it is defined as negative. The following notation, D k I 1, I 2,..., I k 0 D k I 1, I 2,..., I k 0 D k I 1, I 2,..., I k 0, is used to indicate the existence of a dependence of order k and its sign. Definition 15. (Dependence Function). The Boolean function of variables I 1, I 2,...,I k, whose minterms correspond to the values v 1, v 2,...,v k for which P I 1 v 1, I 2 v 2,...,I k v k P I 1 v 1, I 2 v 2,...,I k v k MI is called the dependence function of variables I 1, I 2,...,I k.

19 398 R. Meo THEOREM 9(PARITY OF DEPENDENCE FUNCTIONS). If D k I 1, I 2,..., I k 0, the dependence function of variables I 1, I 2,...,I k is I 1 Q I 2 Q...I k, that is, the parity function with even parity. If D k I 1, I 2,..., I k 0, the dependence function of variables I 1, I 2,...,I k is I 1 Q I 2 Q... Q I k, that is, the parity function with odd parity. In both cases, and for all values of I 1, I 2,..., I k the difference P i 1, i 2,..., i k P i 1, i 2,..., i k MI has an absolute value equal to. 4. ENTROPY AND DEPENDENCIES A less intuitive but for some aspects more effective approach to determine dependencies can be based on the concept of entropy. In this section only a summary of a possible entropy based theory of dependencies is presented, the task being left to the reader of developing such a theory following the scheme of Section 3. First consider the case of the pairs of items. Assume P(a), P(a ), P(b), P(b ) are known. The entropy of A E A P a logp a P a logp a is the measure of the average information content of the events a A TRUE and a A FALSE. An analogous meaning can be attributed to E B P b logp b P b logp b. Now consider the mutual information where I A;B E A E A B, E A B a *, b * P a *, b * logp b * a * I A;B E A E A B E B E B A. is a measure of the average information content carried by b * on the value of A, and vice versa, and therefore, it can be assumed as an indication of independence of A and B. Unfortunately, mutual information I(A;B) is always positive; so it is necessary to verify whether P(a,b) P(a) P(b), in order to determine the sign of dependence. Therefore, we propose the following definition:

20 Theory of Dependence Values 399 Definition 16. (Entropy-Based Second-Order Dependence). If I(A;B) exceeds a given threshold, we state that D 2 A, B 0 or D 2 A, B 0 according to whether P(a,b) P(a) P(b) or not. If I(A;B) does not exceed that threshold, we state that D 2 A, B 0. The extension of such definition to triplets is not immediate, since the ternary mutual information I(A;B;C) defined in information theory does not own the meaning we need now. Therefore we suggest the following. Definition 17. (Entropy-Based Third-Order Dependence). If both E A B E A B, C and E A C E A B, C exceed a given threshold, we state that D 3 A, B, C 0 when P(a,b,c) is larger than all the following estimates. P a P b, c P b P a, c P c P a, b or D 3 A, B, C 0 when P(a,b,c) is less than all three preceding estimates; otherwise, D 3 A, B, C 0. Definition 17 might seem asymmetric with respect to variables A, B, and C. In order to understand the reasons for which the relationships written in Definition 17 are symmetric, remember that, for example, if E A B E A B, C th then also E C B E C A, B th. In addition, E A B, C E A B and E A B, C E A C. When, for example, E A B E A B, C, E A B, C is maximum and, therefore, E A, B, C E B, C E A B, C is also maximum (maximum independence condition). It follows that E(B A,C) and E(C A,B) also take their maximum values, equal to E(B A) or E(B C) and to E(C A) or E(C B). An analogous definition can be introduced to evaluate dependencies in k-plets with arbitrary k. k Definition 18. (Entropy-Based General Dependence). Let E MIN be the minimum of all the conditional entropies E A C, D,..., Z E A B, D,..., Z..., where {C,D,...,Z}, {B,D,...,Z},... are the subsets of (k 2) variables of the set B, C, D,..., Z of (k 1) variables.

21 400 R. Meo k If E MIN E A B, C, D,..., Z (where B,C,D,...,Z is ) exceeds a given threshold, and P(a,b,c,...,z) is larger than the maximum of the estimates P a P b, c,..., z P b P a, c,..., z P c P a, b,..., z..., we state that D k A, B, C,..., Z 0. k If the same difference E MIN E A B, C, D,..., Z exceeds the given threshold and P(a,b,c,...,z) is smaller than the minimum of the estimates P a P b, c,..., z P b P a, c,..., z P c P a, b,..., z..., we state that D k A, B, C,..., Z 0. If neither of the two above specified conditions holds, we state that D k A, B, C,..., Z 0. Notice that the computation of entropies can be simplified by applying the following theorem. THEOREM 10. If the entropies of order k 1 are known, a single entropy needs to be determined in order to calculate all the entropies of order k. PROOF. Assume we know E I 1 I 2, I 3,..., I k. First we determine E I 2 I 1, I 3,..., I k by observing that I I 1 ;I 2 I 3,..., I k E I 1 I 3, I 4,..., I k E I 1 E I 2 E I 2 I 2, I 3,..., I k I 3, I 4,..., I k I 1, I 3,..., I k, where only the last entropy is unknown. A similar method can be applied to determine all other conditional entropies of the type

22 Theory of Dependence Values 401 E I j I 1, I 2,..., I j 1, I j 1,..., I k (with 2 j k). Finally, any other entropy can be easily calculated in terms of those already calculated. For example, E I 1, I 2 I 3, I 4,..., I k E I 2 I 3, I 4,..., I k E I 1 I 2, I 3,..., I k. e 5. DETERMINING DEPENDENCE VALUES The analysis developed in this article essentially refers to the concept of confidence and not to the principles of support. Almost all the algorithms for data mining proposed thus far are based on a first step to determine which k-plets have sufficient support, that is, sufficient statistical relevance. Such solutions are compatible with the following algorithm for determining all the relevant dependencies up to a certain order. (1) Determining the k-plets that have sufficient support. Most algorithms for determining the k-plets that have sufficient support proceed in order of increasing cardinalities. In other words, they first determine the single items, then the pairs of items, the triplets, and so on. Such algorithms are well suited to the following procedure. The algorithms should be modified in order to examine a k-plet P after the (k 1)-plets contained in P. The program, which was developed specifically to verify the ideas described in this article, is based on an algorithm for determining the k-plets (also called itemsets) that have sufficient support [Meo 1999]. It was chosen for its speed, but produces the list of itemsets organized in a family of trees. However, this data structure, as any other, can be transformed into a lattice suitable for the computations described above in a relatively short time. Note that it is not necessary that the complete lattice of all the itemsets with sufficient support be represented in the main memory at the same time. What is really needed is that, at the starting point, every node of the structure, i.e., very analyzed itemset, is represented by two sets of data: (a) the values of the joint probabilities describing that itemset; (b) the pointers to its parents. For simplicity, as concerns point (a) above, the program described here is characterized by its choice in describing an itemset with a single datum (by virtue of Theorem 7 on the unicity of the value). The chosen datum is the number of occurrences n ab...z of that itemset, which is proportional to its probability P(a,b,...,z) (see Figure 9, where ptr x denotes the pointer to itemset x).

23 402 R. Meo a n a b n b c n c ab n ab ac n ac bc n bc ptr a ptr b ptr a ptr c ptr b ptr c abc n abc ptr ab ptr ac ptr bc Fig. 9. Data structure of the itemsets. This choice makes it possible to store millions of itemsets in the main memory at the same time and to perform all the following computations without storing any partial results in the mass memory. (2) Determining all the joint probabilities of an itemset. The computation of the joint probabilities P i * 1, i * 2,..., i * k for all the combinations of values of i * 1, i * 2,..., i * k can be performed recursively by applying the relationships presented in the proof of Theorem 7 on the unicity of the value. Of course, recursion proceeds towards the parents and the grandparents. For example, in the case of Figure 9, P a, b, c P a, c P a, b, c P a P a, c P a, b P a, b, c where only the probabilities directly connected to the numbers of occurrences introduced in Figure 9 appear. (3) Determining maximum independence estimates. The determination of the value x for which the joint entropy E x P i 1 *, i 2 *,..., i k * log P i 1 *, i 2 *,..., i k * takes its maximum value can be performed numerically, at the desired level of accuracy with conventional interpolation techniques. (4) Computing the dependence value and states. A direct application of Definitions 13 and 14 leads to the final results.

24 Theory of Dependence Values 403 Table I. Number of Itemsets and Their Average Lengths in the Experiments Total Itemset Number Nominal Average Actual Average Experimental Evaluation The proposed approach has been verified with an implementation in C using the Standard Template Library. The program has been run on a PC Pentium II, with a 233 MHz clock, 128 MB RAM, and running Red Hat Linux as the operating system. We worked on a class of databases taken as a benchmark on association rules by most data mining algorithms the class of synthetic databases that project Quest of IBM generated for its experiments (see Agrawal and Srikant [1994] for details). We experimented many times on several databases with different values for the main parameters and for minimum support; but the results were all similar to the ones presented here. In particular, the experiments were run with the value for minimum support equal to 0.2% and with a precision in the computation of the dependence values equal to In generating the databases, we adopted the same parameter settings proposed for synthetic databases: D, the number of transactions in the database, fixed to 100,000; N, the total number of items, fixed to 1000; T, the average transaction length, fixed to 10, since its value does not influence program behavior. On the contrary, I, the average length of the frequent itemsets, was varied, since its value really determines the depth of the lattice to be generated. Each database contains itemsets with sufficient support, with a different average lengths (3, 4,..., 8). The extreme values of the interval [2 10] were discarded for the following reasons. The low value was discarded because it does not make much sense to maximize the entropy related to itemsets having only two items. The direct approach, which compares the probability of such an itemset with the product of the probabilities of the two items, was adopted in this case. High values were avoided because the longer the itemsets, the more probability they have of being under the threshold for minimum support. In this case, too few itemsets are shown to be over the threshold, and comparing the different experiments is not fair. Hence, even if the nominal average itemset length is increased, the actual average length of itemsets with sufficient support is shown as being significantly lower. Table I reports the total number of itemsets with sufficient support and also their nominal and actual average lengths. In Figure 10, two execution times are shown: T 1 is the CPU execution time needed for identifying all the itemsets with sufficient support; T 2 is

25 404 R. Meo Execution times per itemset T1 T CPU time[s] Average itemset length Fig. 10. Experimental results. the time for computing the dependence values of the itemsets identified previously. Both times have been normalized with respect to the total number of itemsets, since this changes considerably in the various experiments. Notice that time T 1 decreases with the actual average length of the itemset. This is a peculiarity of the algorithm (called Seq) adopted for the first step, because during its execution it builds temporary data structures that do not depend on the length of the itemset. Furthermore, Seq proved suitable for very large databases and for searches characterized by very low resolution values. Time T 2, to the contrary, increases with the average length of the itemset because the depth of the resulting lattice increases. This fact is not surprising. On the other hand, note that the increments are moderate, with the exception of the experiment with the average itemset length equal to 4.82 (corresponding to a nominal average itemset length equal to 6). In that experiment, as Table I shows, the total number of itemsets that exceed the minimum support threshold, compared to the itemset length, grows suddenly. The lattice generated in these conditions is very heavily populated. Hence, the high dimension of the output explains the result. Finally, these experiments demonstrate that this new approach to knowledge discovery on itemset dependencies is feasible and suitable for the high-resolution research typical of data mining.

26 Theory of Dependence Values CONCLUSIONS We show in this article that a single real the dependence value contains all the information on dependencies relative to a given itemset. In addition, by virtue of the theorem that states that dependence functions are always parity functions, the combinations of values for determining which dependencies are positive or negative is immediate. In addition, for practical cases, the feasibility of this new theory is demonstrated in a set of experiments on various databases. Some themes are worth developing further. The first concerns the maximum independence estimates. Is it possible to find a closed formula giving the maximum independence probability of a given itemset as a function of lower-level probabilities? The second theme is defning confidence levels. Which percentage of the probability P(a,b,...) must be exceeded by the dependence value in order to state that the dependence is strong? This question has not been discussed adequately in the support confidence framework, but the model introduced here is probably more suitable for sound theoretical analysis. A third area for investigation concerns the algorithms for determining the dependence values. The method proposed in this article assumes that the itemset with sufficient support was determined by adopting one of the known methods and performing the computation of the dependence values on it. However, it is likely that an integrated method combining the two steps is more rapid and effective. Theoretical analysis based on probability and information theory, as well as the development of new algorithms, should be combined and integrated in this research. REFERENCES AGGARWAL, C.C.AND YU, P. S Online generation of association rules. In Proceedings of the 14th International Conference on Data Engineering (Orlando, FL). IEEE Computer Society Press, Los Alamitos, CA, AGRAWAL, R., MANNILA, H., SRIKANT, R., TOIVONEN, H., AND VERKAMO, A. I Fast discovery of association rules. In Knowledge Discovery in Databases, P. S. U. M. Fayyad, G. Piatetsky-Shapiro, and R. U. Eds, Eds. AAAI Press, Menlo Park, CA. AGRAWAL, R. AND SRIKANT, R Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB 94, Santiago, Chile, Sept.). VLDB Endowment, Berkeley, CA, AGRAWAL, R., IMIELINSKI, T., AND SWAMI, A Database mining: A performance perspective. IEEE Trans. Knowl. Data Eng. 5, 6 (Dec.), AGRAWAL, R., IMIELINSKI, T., AND SWAMI, A Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD 93, Washington, DC, May 26-28), P. Buneman and S. Jajodia, Eds. ACM Press, New York, NY, BRIN, S., MOTWANI, R., ULLMAN, J.D., AND TSUR, S Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 97, Tucson, AZ, May 13 15), J. M. Peckman, S. Ram, and M. Franklin, Eds. ACM Press, New York, NY, HAN, J. AND FU, X Discovery of multiple-level association rules from large databases. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB 95, Zurich, Sept.).

27 406 R. Meo HAN, J., CAI, Y., AND CERCONE, N Knowledge discovery in databases: An attributeoriented approach. In Proceedings of the 18th International Conference on Very Large Data Bases (Vancouver, B.C., Aug.). VLDB Endowment, Berkeley, CA, HOUTSMA, M. A. W. AND SWAMI, A Set-oriented mining for association rules in relational databases. In Proceedings of the IEEE 11th International Conference on Data Engineering (Taipei, Taiwan, Mar.6-10). IEEE Press, Piscataway, NJ. IMIELINSKI, T From file mining to database mining. In Proceedings of the ACM SIGMOD International Workshop on Data Mining and Knowledge Discovery (SIGMOD-96, Aug.), R. Ng, Ed. ACM Press, New York, NY, IV, J.F.E.AND PREGIBON, D A statistical perspective on kdd. Tech. Rep. KDD THOMAS, G., KAWAGOE, K., KRISHNAMURTHY, R., IMIELINSKI, T., REINER, D., AND WOLSKI, A Practitioner problems in need of database research. SIGMOD Rec. 20, 3 (Sept.), LIN, D. AND KEDEM, Z Pincer-search: A new algorithm for discovering the maximum frequent set. In Proceedings of the 6th International Conference on EDBT (EDBT, Valencia, Spain, Mar.). MEO, R A new approach for the discovery of frequent itemsets. In Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery (Florence, Aug./Sept.) PARK, J.S.,SHEN, M., AND YU, P. S An effective hash based algorithm for mining association rules. In Proceedings of the ACM SIGMOD Conference on Management of Data (ACM-SIGMOD, San Jose, CA, May). SIGMOD. ACM Press, New York, NY. SAVASERE, A., OMIECINSKI, E., AND NAVATHE, S An efficient algorithm for mining association rules in large databases. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB 95, Zurich, Sept.). SILVERSTEIN, C., BRIN, S., AND MOTWANI, R Beyond market baskets: Generalizing association rules to dependence rules. Data Mining Knowl. Discovery 2, 1, SRIKANT, R.AND AGRAWAL, R Mining quantitative association rules in large relational tables. In Proceedings of the ACM-SIGMOD International Conference on Management of Data (ACM-SIGMOD, San Jose, CA, May). ACM Press, New York, NY. SRIKANT, R. AND AGRAWAL, R Mining generalized association rules. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB 95, Zurich, Sept.). TOIVONEN, H Sampling large databases for association rules. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB 96, Bombay, Sept.). Received: October 1998; revised: September 1999; accepted: January 2000

A new Model for Data Dependencies

A new Model for Data Dependencies A new Model for Data Dependencies Rosa Meo Università di Torino Dipartimento di Informatica Corso Svizzera 185-10149 Torino, Italy meo@di.unito.it ABSTRACT A new model to evaluate dependencies between

More information

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen Book Title Encyclopedia of Machine Learning Chapter Number 00403 Book CopyRight - Year 2010 Title Frequent Pattern Author Particle Given Name Hannu Family Name Toivonen Suffix Email hannu.toivonen@cs.helsinki.fi

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1 Outline n Association Rule

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Juliano Brito da Justa Neves 1 Marina Teresa Pires Vieira {juliano,marina}@dc.ufscar.br Computer Science Department

More information

Finding All Minimal Infrequent Multi-dimensional Intervals

Finding All Minimal Infrequent Multi-dimensional Intervals Finding All Minimal nfrequent Multi-dimensional ntervals Khaled M. Elbassioni Max-Planck-nstitut für nformatik, Saarbrücken, Germany; elbassio@mpi-sb.mpg.de Abstract. Let D be a database of transactions

More information

On Minimal Infrequent Itemset Mining

On Minimal Infrequent Itemset Mining On Minimal Infrequent Itemset Mining David J. Haglin and Anna M. Manning Abstract A new algorithm for minimal infrequent itemset mining is presented. Potential applications of finding infrequent itemsets

More information

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Association Rule. Lecturer: Dr. Bo Yuan. LOGO Association Rule Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Overview Frequent Itemsets Association Rules Sequential Patterns 2 A Real Example 3 Market-Based Problems Finding associations

More information

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context.

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context. Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context. Murphy Choy Cally Claire Ong Michelle Cheong Abstract The rapid explosion in retail data calls for more effective

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

Association Rules. Fundamentals

Association Rules. Fundamentals Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions. Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms DATA MINING LECTURE 4 Frequent Itemsets, Association Rules Evaluation Alternative Algorithms RECAP Mining Frequent Itemsets Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei And slides provide by Raymond

More information

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules DATA MINING LECTURE 4 Frequent Itemsets and Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Claudio Lucchese Ca Foscari University of Venice clucches@dsi.unive.it Raffaele Perego ISTI-CNR of Pisa perego@isti.cnr.it Salvatore

More information

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2) The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set

More information

Selecting a Right Interestingness Measure for Rare Association Rules

Selecting a Right Interestingness Measure for Rare Association Rules Selecting a Right Interestingness Measure for Rare Association Rules Akshat Surana R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining

More information

Chapters 6 & 7, Frequent Pattern Mining

Chapters 6 & 7, Frequent Pattern Mining CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapters

More information

Mining Positive and Negative Fuzzy Association Rules

Mining Positive and Negative Fuzzy Association Rules Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing

More information

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries To appear in Data Mining and Knowledge Discovery, an International Journal c Kluwer Academic Publishers

More information

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH M. De Cock C. Cornelis E. E. Kerre Dept. of Applied Mathematics and Computer Science Ghent University, Krijgslaan 281 (S9), B-9000 Gent, Belgium phone: +32

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #12: Frequent Itemsets Seoul National University 1 In This Lecture Motivation of association rule mining Important concepts of association rules Naïve approaches for

More information

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries Data Mining and Knowledge Discovery, 7, 5 22, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency

More information

Associa'on Rule Mining

Associa'on Rule Mining Associa'on Rule Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 4 and 7, 2014 1 Market Basket Analysis Scenario: customers shopping at a supermarket Transaction

More information

Bottom-Up Propositionalization

Bottom-Up Propositionalization Bottom-Up Propositionalization Stefan Kramer 1 and Eibe Frank 2 1 Institute for Computer Science, Machine Learning Lab University Freiburg, Am Flughafen 17, D-79110 Freiburg/Br. skramer@informatik.uni-freiburg.de

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 04 Association Analysis Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise Guimei Liu 1,2 Jinyan Li 1 Limsoon Wong 2 Wynne Hsu 2 1 Institute for Infocomm Research, Singapore 2 School

More information

Removing trivial associations in association rule discovery

Removing trivial associations in association rule discovery Removing trivial associations in association rule discovery Geoffrey I. Webb and Songmao Zhang School of Computing and Mathematics, Deakin University Geelong, Victoria 3217, Australia Abstract Association

More information

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli

More information

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Analytics Beyond OLAP. Prof. Yanlei Diao Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of

More information

Connections between mining frequent itemsets and learning generative models

Connections between mining frequent itemsets and learning generative models Connections between mining frequent itemsets and learning generative models Srivatsan Laxman Microsoft Research Labs India slaxman@microsoft.com Prasad Naldurg Microsoft Research Labs India prasadn@microsoft.com

More information

Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

CS 584 Data Mining. Association Rule Mining 2

CS 584 Data Mining. Association Rule Mining 2 CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M

More information

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion Outline Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Introduction Algorithm Apriori Algorithm AprioriTid Comparison of Algorithms Conclusion Presenter: Dan Li Discussion:

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - 1DL36 Fall 212" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht12 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Frequent Itemset Mining

Frequent Itemset Mining ì 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM COCONUT Team (PART I) IMAGINA 17/18 Webpage: http://www.lirmm.fr/~lazaar/teaching.html Email: lazaar@lirmm.fr 2 Data Mining ì Data Mining (DM) or Knowledge

More information

Mining Temporal Patterns for Interval-Based and Point-Based Events

Mining Temporal Patterns for Interval-Based and Point-Based Events International Journal of Computational Engineering Research Vol, 03 Issue, 4 Mining Temporal Patterns for Interval-Based and Point-Based Events 1, S.Kalaivani, 2, M.Gomathi, 3, R.Sethukkarasi 1,2,3, Department

More information

Association Rule Mining on Web

Association Rule Mining on Web Association Rule Mining on Web What Is Association Rule Mining? Association rule mining: Finding interesting relationships among items (or objects, events) in a given data set. Example: Basket data analysis

More information

Maintaining Frequent Itemsets over High-Speed Data Streams

Maintaining Frequent Itemsets over High-Speed Data Streams Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,

More information

Alternative Approach to Mining Association Rules

Alternative Approach to Mining Association Rules Alternative Approach to Mining Association Rules Jan Rauch 1, Milan Šimůnek 1 2 1 Faculty of Informatics and Statistics, University of Economics Prague, Czech Republic 2 Institute of Computer Sciences,

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - DL360 Fall 200 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht0 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

732A61/TDDD41 Data Mining - Clustering and Association Analysis

732A61/TDDD41 Data Mining - Clustering and Association Analysis 732A61/TDDD41 Data Mining - Clustering and Association Analysis Lecture 6: Association Analysis I Jose M. Peña IDA, Linköping University, Sweden 1/14 Outline Content Association Rules Frequent Itemsets

More information

NetBox: A Probabilistic Method for Analyzing Market Basket Data

NetBox: A Probabilistic Method for Analyzing Market Basket Data NetBox: A Probabilistic Method for Analyzing Market Basket Data José Miguel Hernández-Lobato joint work with Zoubin Gharhamani Department of Engineering, Cambridge University October 22, 2012 J. M. Hernández-Lobato

More information

Discovery of Functional and Approximate Functional Dependencies in Relational Databases

Discovery of Functional and Approximate Functional Dependencies in Relational Databases JOURNAL OF APPLIED MATHEMATICS AND DECISION SCIENCES, 7(1), 49 59 Copyright c 2003, Lawrence Erlbaum Associates, Inc. Discovery of Functional and Approximate Functional Dependencies in Relational Databases

More information

Machine Learning: Pattern Mining

Machine Learning: Pattern Mining Machine Learning: Pattern Mining Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008 Pattern Mining Overview Itemsets Task Naive Algorithm Apriori Algorithm

More information

Guaranteeing the Accuracy of Association Rules by Statistical Significance

Guaranteeing the Accuracy of Association Rules by Statistical Significance Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge

More information

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING, LIG M2 SIF DMV course 207/208 Market basket analysis Analyse supermarket s transaction data Transaction = «market basket» of a customer Find which items are

More information

An Approach to Classification Based on Fuzzy Association Rules

An Approach to Classification Based on Fuzzy Association Rules An Approach to Classification Based on Fuzzy Association Rules Zuoliang Chen, Guoqing Chen School of Economics and Management, Tsinghua University, Beijing 100084, P. R. China Abstract Classification based

More information

FP-growth and PrefixSpan

FP-growth and PrefixSpan FP-growth and PrefixSpan n Challenges of Frequent Pattern Mining n Improving Apriori n Fp-growth n Fp-tree n Mining frequent patterns with FP-tree n PrefixSpan Challenges of Frequent Pattern Mining n Challenges

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #11: Frequent Itemsets

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #11: Frequent Itemsets CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #11: Frequent Itemsets Refer Chapter 6. MMDS book. VT CS 5614 2 Associa=on Rule Discovery Supermarket shelf management Market-basket model:

More information

A Clear View on Quality Measures for Fuzzy Association Rules

A Clear View on Quality Measures for Fuzzy Association Rules A Clear View on Quality Measures for Fuzzy Association Rules Martine De Cock, Chris Cornelis, and Etienne E. Kerre Fuzziness and Uncertainty Modelling Research Unit Department of Applied Mathematics and

More information

CS 412 Intro. to Data Mining

CS 412 Intro. to Data Mining CS 412 Intro. to Data Mining Chapter 6. Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Jiawei Han, Computer Science, Univ. Illinois at Urbana -Champaign, 2017 1 2 3

More information

Constraint-Based Rule Mining in Large, Dense Databases

Constraint-Based Rule Mining in Large, Dense Databases Appears in Proc of the 5th Int l Conf on Data Engineering, 88-97, 999 Constraint-Based Rule Mining in Large, Dense Databases Roberto J Bayardo Jr IBM Almaden Research Center bayardo@alummitedu Rakesh Agrawal

More information

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci Association Rules Information Retrieval and Data Mining Prof. Matteo Matteucci Learning Unsupervised Rules!?! 2 Market-Basket Transactions 3 Bread Peanuts Milk Fruit Jam Bread Jam Soda Chips Milk Fruit

More information

CS 484 Data Mining. Association Rule Mining 2

CS 484 Data Mining. Association Rule Mining 2 CS 484 Data Mining Association Rule Mining 2 Review: Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due

More information

A Logical Formulation of the Granular Data Model

A Logical Formulation of the Granular Data Model 2008 IEEE International Conference on Data Mining Workshops A Logical Formulation of the Granular Data Model Tuan-Fang Fan Department of Computer Science and Information Engineering National Penghu University

More information

Finding Association Rules that Trade Support Optimally Against Confidence

Finding Association Rules that Trade Support Optimally Against Confidence Finding Association Rules that Trade Support Optimally Against Confidence Tobias Scheffer Humboldt-Universität zu Berlin, Department of Computer Science Unter den Linden 6, 199 Berlin, Germany scheffer@informatik.hu-berlin.de

More information

Density-Based Clustering

Density-Based Clustering Density-Based Clustering idea: Clusters are dense regions in feature space F. density: objects volume ε here: volume: ε-neighborhood for object o w.r.t. distance measure dist(x,y) dense region: ε-neighborhood

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population

More information

Discovery of Frequent Word Sequences in Text. Helena Ahonen-Myka. University of Helsinki

Discovery of Frequent Word Sequences in Text. Helena Ahonen-Myka. University of Helsinki Discovery of Frequent Word Sequences in Text Helena Ahonen-Myka University of Helsinki Department of Computer Science P.O. Box 26 (Teollisuuskatu 23) FIN{00014 University of Helsinki, Finland, helena.ahonen-myka@cs.helsinki.fi

More information

2002 Journal of Software

2002 Journal of Software 1-9825/22/13(3)41-7 22 Journal of Software Vol13, No3,,,, (, 2326) E-mail inli@ustceducn http//wwwustceducn,,,,, Agrawal,,, ; ; ; TP18 A,,,,, ( ),,, ; Agrawal [1], [2],, 1 Agrawal [1], [1],Agrawal,, Heikki

More information

Data mining, 4 cu Lecture 5:

Data mining, 4 cu Lecture 5: 582364 Data mining, 4 cu Lecture 5: Evaluation of Association Patterns Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Evaluation of Association Patterns Association rule algorithms

More information

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS Arumugam G Senior Professor and Head, Department of Computer Science Madurai Kamaraj University Madurai,

More information

NEGATED ITEMSETS OBTAINING METHODS FROM TREE- STRUCTURED STREAM DATA

NEGATED ITEMSETS OBTAINING METHODS FROM TREE- STRUCTURED STREAM DATA NEGATED ITEMSETS OBTAINING METHODS FROM TREE- STRUCTURED STREAM DATA JURYON PAIK Pyeongtaek University, Department of Digital Information and Statistics, Gyeonggi-do 17869, South Korea E-mail: jrpaik@ptu.ac.kr

More information

InfoMiner: Mining Surprising Periodic Patterns

InfoMiner: Mining Surprising Periodic Patterns InfoMiner: Mining Surprising Periodic Patterns Jiong Yang IBM Watson Research Center jiyang@us.ibm.com Wei Wang IBM Watson Research Center ww1@us.ibm.com Philip S. Yu IBM Watson Research Center psyu@us.ibm.com

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization

More information

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany Mining Rank Data Sascha Henzgen and Eyke Hüllermeier Department of Computer Science University of Paderborn, Germany {sascha.henzgen,eyke}@upb.de Abstract. This paper addresses the problem of mining rank

More information

Mining Free Itemsets under Constraints

Mining Free Itemsets under Constraints Mining Free Itemsets under Constraints Jean-François Boulicaut Baptiste Jeudy Institut National des Sciences Appliquées de Lyon Laboratoire d Ingénierie des Systèmes d Information Bâtiment 501 F-69621

More information

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation 12.3.2008 Lauri Lahti Association rules Techniques for data mining and knowledge discovery in databases

More information

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Association

More information

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017) Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar (modified by Predrag Radivojac, 27) Association Rule Mining Given a set of transactions, find rules that will predict the

More information

A Note on the Recursive Calculation of Incomplete Gamma Functions

A Note on the Recursive Calculation of Incomplete Gamma Functions A Note on the Recursive Calculation of Incomplete Gamma Functions WALTER GAUTSCHI Purdue University It is known that the recurrence relation for incomplete gamma functions a n, x, 0 a 1, n 0,1,2,..., when

More information

Levelwise Search and Borders of Theories in Knowledge Discovery

Levelwise Search and Borders of Theories in Knowledge Discovery Data Mining and Knowledge Discovery 1, 241 258 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Levelwise Search and Borders of Theories in Knowledge Discovery HEIKKI MANNILA

More information

Algorithms for Characterization and Trend Detection in Spatial Databases

Algorithms for Characterization and Trend Detection in Spatial Databases Published in Proceedings of 4th International Conference on Knowledge Discovery and Data Mining (KDD-98) Algorithms for Characterization and Trend Detection in Spatial Databases Martin Ester, Alexander

More information

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border Guimei Liu a,b Jinyan Li a Limsoon Wong b a Institute for Infocomm Research, Singapore b School of

More information

Association Analysis. Part 1

Association Analysis. Part 1 Association Analysis Part 1 1 Market-basket analysis DATA: A large set of items: e.g., products sold in a supermarket A large set of baskets: e.g., each basket represents what a customer bought in one

More information

Processing Count Queries over Event Streams at Multiple Time Granularities

Processing Count Queries over Event Streams at Multiple Time Granularities Processing Count Queries over Event Streams at Multiple Time Granularities Aykut Ünal, Yücel Saygın, Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey. Faculty of Engineering

More information

Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules

Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules International Journal of Innovative Research in Computer Scien & Technology (IJIRCST) Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules Mir Md Jahangir

More information

Selecting the Right Interestingness Measure for Association Patterns

Selecting the Right Interestingness Measure for Association Patterns Selecting the Right ingness Measure for Association Patterns Pang-Ning Tan Department of Computer Science and Engineering University of Minnesota 2 Union Street SE Minneapolis, MN 55455 ptan@csumnedu Vipin

More information

Relation between Pareto-Optimal Fuzzy Rules and Pareto-Optimal Fuzzy Rule Sets

Relation between Pareto-Optimal Fuzzy Rules and Pareto-Optimal Fuzzy Rule Sets Relation between Pareto-Optimal Fuzzy Rules and Pareto-Optimal Fuzzy Rule Sets Hisao Ishibuchi, Isao Kuwajima, and Yusuke Nojima Department of Computer Science and Intelligent Systems, Osaka Prefecture

More information

Inferring minimal rule covers from relations

Inferring minimal rule covers from relations Inferring minimal rule covers from relations CLAUDIO CARPINETO and GIOVANNI ROMANO Fondazione Ugo Bordoni, Via B. Castiglione 59, 00142 Rome, Italy Tel: +39-6-54803426 Fax: +39-6-54804405 E-mail: carpinet@fub.it

More information

Standardizing Interestingness Measures for Association Rules

Standardizing Interestingness Measures for Association Rules Standardizing Interestingness Measures for Association Rules arxiv:138.374v1 [stat.ap] 16 Aug 13 Mateen Shaikh, Paul D. McNicholas, M. Luiza Antonie and T. Brendan Murphy Department of Mathematics & Statistics,

More information

Chapter 4: Frequent Itemsets and Association Rules

Chapter 4: Frequent Itemsets and Association Rules Chapter 4: Frequent Itemsets and Association Rules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision

More information

Application of Apriori Algorithm in Open Experiment

Application of Apriori Algorithm in Open Experiment 2011 International Conference on Computer Science and Information Technology (ICCSIT 2011) IPCSIT vol. 51 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V51.130 Application of Apriori Algorithm

More information

Approximating a Collection of Frequent Sets

Approximating a Collection of Frequent Sets Approximating a Collection of Frequent Sets Foto Afrati National Technical University of Athens Greece afrati@softlab.ece.ntua.gr Aristides Gionis HIIT Basic Research Unit Dept. of Computer Science University

More information

Mining Strong Positive and Negative Sequential Patterns

Mining Strong Positive and Negative Sequential Patterns Mining Strong Positive and Negative Sequential Patter NANCY P. LIN, HUNG-JEN CHEN, WEI-HUA HAO, HAO-EN CHUEH, CHUNG-I CHANG Department of Computer Science and Information Engineering Tamang University,

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

Chapter 1 (Basic Probability)

Chapter 1 (Basic Probability) Chapter 1 (Basic Probability) What is probability? Consider the following experiments: 1. Count the number of arrival requests to a web server in a day. 2. Determine the execution time of a program. 3.

More information

Similarity of Attributes by External Probes. P.O. Box 26, FIN Helsinki, Finland. with larger domains.)

Similarity of Attributes by External Probes. P.O. Box 26, FIN Helsinki, Finland. with larger domains.) Similarity of Attributes by External Probes Gautam Das University of Memphis Department of Mathematical Sciences Memphis TN 8, USA dasg@msci.memphis.edu Heikki Mannila and Pirjo Ronkainen University of

More information

Mining Infrequent Patter ns

Mining Infrequent Patter ns Mining Infrequent Patter ns JOHAN BJARNLE (JOHBJ551) PETER ZHU (PETZH912) LINKÖPING UNIVERSITY, 2009 TNM033 DATA MINING Contents 1 Introduction... 2 2 Techniques... 3 2.1 Negative Patterns... 3 2.2 Negative

More information