Theory of Dependence Values

Size: px

Start display at page:

Download "Theory of Dependence Values"

Isaac Hood
5 years ago
Views:

1 Theory of Dependence Values ROSA MEO Università degli Studi di Torino A new model to evaluate dependencies in data mining problems is presented and discussed. The well-known concept of the association rule is replaced by the new definition of dependence value, which is a single real number uniquely associated with a given itemset. Knowledge of dependence values is sufficient to describe all the dependencies characterizing a given data mining problem. The dependence value of an itemset is the difference between the occurrence probability of the itemset and a corresponding maximum independence estimate. This can be determined as a function of joint probabilities of the subsets of the itemset being considered by maximizing a suitable entropy function. So it is possible to separate in an itemset of cardinality k the dependence inherited from its subsets of cardinality k 1 and the specific inherent dependence of that itemset. The absolute value of the difference between the probability P(i) of the event i that indicates the presence of the itemset {a,b,...} and its maximum independence estimate is constant for any combination of values of a, b,.... In addition, the Boolean function specifying the combinations of values for which the dependence is positive is a parity function. So the determination of such combinations is immediate. The model appears to be simple and powerful. Categories and Subject Descriptors: H.2.8 [Database Management]: Database applications Data mining; Statistical databases; H.1.1 [Models and Principles]: Systems and Information Theory Information theory; I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods General Terms: Algorithms, Experimentation, Theory Additional Key Words and Phrases: Association rules, dependence rules, entropy, variables independence 1. INTRODUCTION A well-known problem in data mining is the search for association rules, a powerful and intuitive conceptual tool to represent the phenomena that are recurrent in a data set. A number of interesting solutions of that problem have been proposed in the last five years together with as many powerful The author was previously (until November 1999) at Politecnico di Torino. Author s address: Department of Computer Science, Università degli Studi di Torino, corso Svizzera, 185, Torino, 10149, Italy. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee ACM /00/ $5.00 ACM Transactions on Database Systems, Vol. 25, No. 3, September 2000, Pages

2 Theory of Dependence Values 381 algorithms [Agrawal et al. 1993b; 1995; Agrawal and Srikant 1994; Savasere et al. 1995; Han and Fu 1995; Park et al. 1995; Toivonen 1996; Brin et al. 1997; Lin and Kedem 1998]. They are used in many application fields, such as analysis of supermarket basket data, failures in telecommunication networks, medical test results, lexical features of texts, etc.. An association rule is an expression of the form X f Y, where X and Y are sets of items that are often found together in a given collection of data. For example, the expression {milk, coffee} f {bread, sugar} might mean that a customer purchasing milk and coffee is likely to also purchase bread and sugar. The validity of an association rule has been based on certain measures. The first measure, called support, is the percentage of transactions of the database containing both X and Y. The second one, called confidence, isthe probability that, if X is purchased, Y is also purchased. In the case of the previous example, a support value of 2% and a confidence value of 15% would mean that 2% of all the customers buy milk, coffee, bread, and sugar, and that 15% of the customers that buy milk and coffee also buy bread and sugar. Recently, Silverstein et al. [1998] have presented a critique of the concept of association rule and the related support-confidence framework. They have observed that the association rule model is well-suited to the market basket problem, but that it does not address other data mining problems. In place of association rules and the support-confidence framework, Silverstein et al. propose a statistical approach based on the chi-squared measure and a new model of rules, called dependence rules. This work can be viewed as a continuation of the line of rules, even if the model and the tools proposed here are rather different, and in particular, the concept of dependence rules has been replaced by the concept of dependence values. This article is organized as follows. Section 2 contains a summary of the main results of earlier work, the emphasis being placed on the supportconfidence framework, the critique of this model by Silverstein et al. and the concept of dependence rules in opposition to the one of association rules. Section 3 contains the definition of dependence value and other basic definitions of the model proposed here as well as the theorems following from these definitions. These theorems suggest an easy and quick way to determine the dependence values, which is described in Section 5, whereas Section 4 discusses the use of the well-known concept of entropy as a tool to evaluate the relevance of a dependence rule. Finally, Section 6 draws the conclusions. 2. ASSOCIATION RULES AND DEPENDENCE RULES As mentioned, this section contains a summary of earlier work on association rules. For ease of reference, the notation used by Silverstein et al. is adopted here.

3 382 R. Meo 2.1 Association Rules Let I i 1, i 2,...,i k be a set of k elements, called items. Basket of items is any subset of I. For example, in the market basket application, I {milk, coffee, bread, sugar, tea,...} contains all the items stocked by a supermarket and a basket of items such as {milk, coffee, bread, sugar} is the set of purchases from one register transaction. As a second example, in the document basket application, I is the set of all the dictionary words and each basket is the set of all the words used in a given document. An association rule X f Y, where X and Y are disjoint subsets of I, was defined by Agrawal et al. [1993b] as follows. X f Y if, and only if, X Y is a subset of at least s% (the support) of all the baskets, and of all the baskets containing all the items of X at least c% (the confidence) contain all the items of Y. The concept of association rules and the related support confidence framework are very powerful and useful, but they suffer from some limitation, especially when the absence of items is considered. An interesting example proposed by Silverstein et al. is the following. Consider the purchase of tea (t) and coffee (c) in a grocery store and assume the probabilities: P c, t 0.2 P c, t 0.7 P c, t 0.05 P c, t 0.05 where c and t denote the events coffee not purchased and tea not purchased, respectively. According to the preceding definitions, the potential rule tea f coffee has a support equal to 20% and a confidence equal to 80%, and therefore can be considered as a valid association rule. However, a deeper analysis shows that a customer buying tea is less likely to also buy coffee than a customer not buying tea (80% against more than 90%). We would write tea f coffee, but on the contrary, the strongest positive dependence is between the absence of coffee and the presence of tea. 2.2 Dependence Rules Silverstein et al. propose a view of basket data in terms of Boolean indicator variables, as follows. Let I 1, I 2,...,I k be a set of k Boolean variables called attributes. A set of baskets b 1, b 2,..., b n is a collection of the n k-tuples from TRUE, FALSE k which represent a collection of value assignments to the k

4 attributes. Assigning the value TRUE to an attribute variable I j in a basket represents the presence of item i j in the basket. The event a denotes A TRUE or, equivalently, the presence of the corresponding item a in a basket. The complementary event a denotes A FALSE, or, the absence of item a from a basket. The probability that item a appears in a random basket is denoted P(a) P(A TRUE). Likewise, P(a, b) P(A TRUE B FALSE) is the probability that item a is present and item b is absent. Silverstein et al. have proposed the following definitions of independence and dependence of events and variables. Definition 1. Definition 2. Theory of Dependence Values 383 Two events x and y are independent if P x y P x P y. Two variables A and B are independent if P A v a B v b P A v a P B v b for all possible values v a, v b TRUE, FALSE. Definition 3. Events, or variables, that are not independent are dependent. Definition 4. Let I be a set of attribute variables. We say that the set I is a dependence rule if I is dependent. The following Theorem 1 is based on the preceding Definitions 1 through 4. THEOREM 1. If a set of variables I is dependent, so is every superset of I. Theorem 1 is important in the dependence rule model, because it makes it possible to restrict attention to the set of minimally dependent itemsets, where a minimally dependent itemset I is such if it is dependent, but none of its subsets is dependent. Silverstein et al. have proposed using the X 2 test for independence to identify dependence rules. The X 2 statistic is upward-closed with respect to the lattice of all possible itemsets, as well as dependence rules. In other words, if a set I of items is deemed dependent at significance level, then all supersets of I are also dependent at the same significance level and, therefore, they do not need to be examined for dependence or independence. 3. DEPENDENCE VALUES In this section the new model based on the concept of dependence values is presented and discussed. A theorem proved in this section provides the basic tools to evaluate the dependence rules of a certain itemset. To simplify the presentation, we proceed from the simplest cases towards the most complex ones, in the order of increasing cardinality of itemsets. In other words, we discuss dependence rules first for pairs of items, then for triplets of items, and finally for m-plets of arbitrary cardinality m.

5 384 R. Meo 3.1 Dependence Rules for Pairs of Items Assume we know the occurrence probabilities of all the items: P I 1 TRUE, P I 2 TRUE,..., P I k TRUE. The evaluation of such probabilities is the first problem of data mining, but it is seldom considered because of its simplicity. Generally, the maximum likelihood estimate is adopted according to which P(a) is assumed equal to O(a)/n, where O(a) is the number of baskets containing a and n is the total number of baskets. However, more complex computations based on Bayes s Theorem might also be used. In the absence of specific determinations, if we know only P A TRUE and P B TRUE, we might formulate the conjectures: P a, b P a P b P a, b P a P b P a, b P a P b P a, b P a P b. These conjectures are equivalent to the assumption that variables A and B are independent. Assume that the exact determination of P(a,b), evaluated as O(a,b)/n (where O(a,b) is the number of baskets containing both a and b), is different from the conjecture P(a,b) P(a) P(b): P a, b P a P b. It is easy to prove the following theorem. THEOREM 2(UNICITY OF THE VALUE FOR SECOND-ORDER PROBABILITIES). P(A TRUE) and P(B TRUE) are known, determination of a single value If P a, b P a P b O a, b n O a n O b n is sufficient to evaluate all the second-order joint probabilities P a, b, P a, b, P a, b. PROOF. The proof is contained in the following simple relationships: P a, b P a P a, b P a P a P b

6 Theory of Dependence Values 385 P a 1 P b P a P b. Analogously, P a, b P b P a, b P b P a P b P b 1 P a P a P b, and P a, b P a P b P a, b P a P b P a P b P a 1 P a P b P a P a P b P a 1 P b P a P b. e The fact that a single datum contains all the information pertaining to joint probabilities of pair {A, B} suggests the following definitions. Definition 5. (Dependence Value of a Pair). The dependence value of the pair {A,B} is defined as the difference P a, b P a P b. Definition 6. (Dependence State of a Pair). If the absolute value of P a, b P a P b exceeds a given threshold th, A and B are said to be dependent. If th dependence is defined as positive; otherwise, it is defined as negative. The following notations D 2 A, B 0 D 2 A, B 0 D 2 A, B 0

7 386 R. Meo Fig. 1. The joint probabilities of P(a,b) in the cells of Karnaugh s map of {A,B}. are adopted to indicate a positive, negative, or no dependence, respectively. Figure 1 shows that the difference between the joint probability P a *, b * (with a * a or a * a and b * b or b * b) and the corresponding a-priori estimate P a * P b * always has the same absolute value but a different sign in the various cells of the Karnaugh s map of variables A and B. To represent this fact, we need another definition and a new theorem. Definition 7. (Dependence Function of Two Variables). The Boolean function of variables A and B, whose minterms correspond to the values v A, v B for which P A v A B v B P A v A P B v B is called the dependence function of variables A and B. THEOREM 3(PARITY OF TWO-VARIABLE DEPENDENCE FUNCTIONS). 0, the dependence function of variables A and B is: A Q B A B A B, If D 2 A, B which is the parity function with parity odd (Figure 2). If D 2 A, B 0, the dependence function of variables A and B is: A Q B A B A B, which is the parity function with parity even (Figure 3). As a simple example, consider the case of purchases of coffee (c) and tea (t), which was discussed in Silverstein et al. [1998] to show the weakness of the traditional support confidence framework (Section 2.1). If then P c, t 0.2 P c, t 0.7 P c, t 0.05 P c, t 0.05 P C TRUE 0.9 P T TRUE 0.25.

8 Theory of Dependence Values 387 Fig. 2. The dependence function of variables A and B if D 2 A, B 0. Fig. 3. The dependence function of variables A and B if D 2 A, B 0. Therefore P(c) P(t) and P c, t P c P t which shows that dependence is negative (D 2 C, T 0). One might wonder whether the usual notation X f Y adopted in several well-known papers on data mining still makes sense and how to indicate a negative dependence such as D 2 C, T 0 ( C f T or C f T or T f C or T f C?) The answer is simple: P c, t P c P t or, simply, D 2 c, t 0 contain all information on the second-order dependencies. However, one might argue that is more significant for the events having a lower probability. In the case of coffee and tea, P( c,t) 0.05 is the lowest probability in the cells of the dependence function; therefore, it is not completely unreasonable to write: C f T. 3.2 Dependence Rules for Triplets of Items This section is devoted to the generalization of definitions and theorems presented in Section 3.1 to the case of triplets of items. As we show, such generalization implies some new problems. Consider the case of a triplet of the Boolean variables A, B, and C, and assume we know the first- and second-order joint probabilities such as P(a b) P(a,b)/P(b), and others. We are interested in determining the third-order joint probabilities of triplets such as P(a,b,c), P(a,b, c ), and so on, from which the third-order conditional probabilities such as P(a b,c) P(A TRUE B TRUE,C TRUE) follow directly. The following theorem shows that knowledge of a single third-order probability is sufficient to determine all the third-order probabilities. THEOREM 4(UNICITY OF THE VALUE FOR THIRD-ORDER PROBABILITIES). All the third-order joint probabilities can be calculated as functions of first- and second-order joint probabilities and a single datum such as a third-order joint probability. PROOF. Assume, for example, we know P(a,b,c). The other joint probabilities can be determined as follows.

9 388 R. Meo P a, b, c P a, b P a, b, c P a, b, c P a, c P a, b, c P a, b, c P b, c P a, b, c P a, b, c P a, c P a, b, c P a, b, c P a, b P a, b, c P a, b, c P a, c P a, b, c P a, b, c P a, b P a, b, c. e Theorem 4 may be viewed as an extension of Theorem 2 on the unicity of the value for second-order probabilities shown in Section 3.1. However, Theorem 2 makes reference to the differences between the determined P a *, b * and the estimated P a * P b * which correspond to the conjecture of independence of a and b. In the case of triplets, the condition of independence is more difficult to identify. Our proposal is contained in the following considerations. The relationships written in the proof of Theorem 4 can also be formulated as: P a, b, c x P a, b, c P a, b x P a, b, c P a, c x P a, b, c P b, c x P a, b, c P a, c P a, b x P a, b, c P a, b P b, c x P a, b, c P a, c P b, c x P a, b, c P a, b P a, c P b, c x. They express the values of all the third-order joint probabilities as functions of the known second-order probabilities P(a,b), P(a,c),..., and the unknown third-order probability x P(a,b,c). Now consider the entropy E x P a, b, c logp a, b, c P a, b, c logp a, b, c... x logx P a, b x log P a, b x....

10 Theory of Dependence Values 389 This function of the unknown x is the average amount of information needed to know a, b, and c. The maximum value of E(x) is reached when a, b, and c are at the maximum level of independence compatible with the dependencies imposed by the second-order joint probabilities. This consideration explains the following Definition 8. Definition 8. (Maximum Independence Estimate for Third-Order Probabilities) If first- and second-order joint probabilities are known but no information is available on the third-order probabilities, the conjecture x of P(a,b,c) maximizing the joint entropy of A, B, C: E x P a, b, c logp a, b, c (where the sum is to be extended to all the combinations of values of a, b, and c) is defined as the maximum independence estimate. Such maximum independence estimate is denoted with the symbol P a, b, c MI. Analogously, for any combination x *, a *, b *, c * of values of a, b, c, we shall define P a *, b *, c * MI as the value of P a *, b *, c * for which E x * is maximum. Notice that in virtue of Theorem 4, for any combination of values a *, b *, c * of a, b, c, P a *, b *, c * MI can be computed in terms of second-order joint probabilities and P a, b, c MI by applying the relationships P a, b, c MI P b, c P a, b, c MI P a, b, c MI P a, c P a, b, c MI P a, b, c MI P a, b P a, b, c MI P a, b, c MI P a, c P a, b, c MI P a, b, c MI P a, b P a, b, c MI P a, b, c MI P a, c P a, b, c MI P a, b, c MI P a, b P a, b, c MI. The meaning of Definition 8 is rather important for the model presented in this article. If D 2 A, B, D 2 A, C, D 2 B, C are 0 or 0, A, B, C are not independent, but they could own only the dependence inherited from the second-order dependencies, or their dependence might be stronger. In the former case, P a *, b *, c is equal to P a *, b *, c * MI and there is no real third-order dependence. In the latter, there is evidence of a third-order dependence whose value and sign depend on the differences between P a *, b *, c * and P a *, b *, c * MI, as shown in the following analysis.

11 390 R. Meo Notice that in the case of the pairs of items, Definition 8 of maximum independence coincides with the more known definitions of independence cited in Section 3.1. Indeed, in this case, as is shown in Figure 1, the joint entropy of A and B is E P a *, b * logp a *, b * x logx P a x log P a x P b x log P b x 1 P a P b x log 1 P a P b x, where x P(a,b). It is easy to prove that function E has a maximum for x P a, b MI P a P b. By applying the same algorithm, it is easy to prove the analogous results: y P a, b MI z P a, b MI w P a, b MI P a P b P a P b P a P b. Unfortunately, in the case of triplets and k-plets the determination of the maximum independence estimates is not so simple. However, as shown, it is not necessary to know all the estimates P i * 1, i * 2,..., i * k MI as one of them is sufficient to determine all the other ones. In addition, the numerical evaluation of this estimate can be performed very quickly by applying the method described in Section 5. The definition of the maximum independence estimate is applied in the following theorem, which can be viewed as a specification of Theorem 4 on the unicity of value for third-order probabilities and as the natural extension of Theorem 2 proved in Section 3.1. THEOREM 5. If the first- and second-order joint probabilities and the third-order maximum independence estimate are known, a single number defined as the difference P a, b, c P a, b, c MI is sufficient to specify all the third-order joint probabilities. PROOF. Theorem 5 is a direct consequence of Theorem 4 on the unicity of the value. Indeed, from the knowledge of the first- and second-order joint probabilities we can obtain P a, b, c MI and from this P a, b, c P a, b, c MI. But, according to Theorem 4, the knowledge of P(a,b,c) is sufficient to determine all the third-order joint probabilities. e By virtue of Theorem 5, we can state the following definitions which are an extension of Definitions 5 and 6.

12 Theory of Dependence Values 391 Definition 9. (Dependence Value of a Triplet) The dependence value of the triplet {A,B,C} is defined as the difference P a, b, c P a, b, c MI. Definition 10. (Dependence State of a Triplet). If the dependence value of {A,B,C} P a, b, c P a, b, c MI exceeds a given threshold th, A, B, and C are defined as connected by a third-order dependence. If th, dependence is defined as positive; otherwise, it is defined as negative. The following notations D 3 A, B, C 0 D 3 A, B, C 0 D 3 A, B, C 0 are used to indicate the existence of a third-order dependence and its sign. Notice that in the model proposed by Silverstein et al. the existence of one or more second-order dependencies implies the existence of the thirdorder dependence, whereas in our model D 2 A, B, D 2 A, C, D 2 B, C and D 3 A, B, C are independent, in the sense that any combination of their values is possible. For example, even if all three second-order dependencies are positive, D 3 A, B, C might be zero or negative. In Section 3.3, an example about the purchase of a triplet of items and the differences with respect to the other models are discussed. The following Definition 11 on the dependence function of three variables and Theorem 6 extend the statements of Definition 7 on the dependence function for pairs and Theorem 3 on the parity function to third-order dependencies. Definition 11. (Dependence Function of Three Variables) The Boolean function of variables A, B, and C, whose minterms correspond to the values v A, v B, v C for which P A v A B v B C v C P A v A B v B C v C MI is called the dependence function of variables A, B, and C. THEOREM 6. (PARITY OF DEPENDENCE FUNCTIONS OF THREE VARIABLES). D 3 A, B, C 0, the dependence function of variables A, B, and C is A Q B Q C A B C A B C A B C A B C, that is, the parity function with parity even (Figure 4). If D 3 A, B, C 0, the dependence function is If A Q B Q C A B C A B C A B C A B C,

13 392 R. Meo Fig. 4. The dependence function when D 3 A, B, C 0. that is, the parity function with parity odd and the complementary function of the preceding one (Figure 5). PROOF. By definition, P a, b, c P a, b P a, b, c P a, b P a, b, c MI P a, b P a, b, c MI P a, b, c MI. The values presented in Figure 6 follow from analogous computations. From these, it is immediate to derive the two maps of Figures 4 and 5, when D 3 A, B, C 0, ord 3 A, B, C 0, respectively. e Justification of the Maximum Independence Definition. The idea of maximum independence introduced in this article is not intuitively obvious and needs some further justification. First consider the simple case of two variables A and B. In this case, as shown above, the definition of maximum independence coincides with the well-known definition of absolute independence, according to which A and B are independent if, and only if, P A v A, B v B P A v A P B v B for any combination of values of A and B. It is well known that the joint entropy of A and B E(A,B) E(A) E(B A) E (B) E(A B), where E(B A) and E(A B) are the equivocation of B with respect to A and the equivocation of A with respect to B, respectively. Therefore, the maximum value of E(A,B) is reached when E(B A) (or E(A B)) is maximum. When A and B are independent, the amount of information needed to know B, ifa is known, or to know A, ifb is known, is maximum. Notice that in this case E(A B) E(A) and E(B A) E(B). These equalities will not hold in the case of three variables. Now consider the case of three variables A, B, and C. In general, if the probabilities of A, B, and C, and the second-order joint probabilities P(A,B), P(B,C), and P(A,C) have been assigned, there is no assignment of the probability P(A,B,C) for which A, B, and C are independent, that is, P A v A, B v B, C v C P A v A P B v B P C v C for any combination of values of v A, v B, and v C. However, it makes sense to search the value of P(A,B,C) for which the joint entropy E(A,B,C) is maximum and to define that condition as the one of the maximum level of independence compatible with the dependencies imposed by the second-order joint probabilities. Indeed,

14 Theory of Dependence Values 393 Fig. 5. The dependence function when D 3 A, B, C 0. Fig. 6. The dependence function for the three variables A, B, and C. E A, B, C E A, B E C A, B E A, C E B A, C E B, C E A B, C Therefore, since E(A,B), E(A,C), and E(B,C) depend only on the values of second-order probabilities, E(A,B,C) reaches its maximum for that assignment of P(A,B,C) for which E(C A,B), E(B A,C), and E(A B,C)also reach their maximum values. In other words, the maximum independence level corresponds to the condition in which the maximum amount of information is needed to know the value of a variable, the other two being known. However, in general, since A, B, and C are not independent, E A B, C E A B E A and E A B, C E A C E A, and this is different from the case of pairs of variables for which the concepts of maximum independence and absolute independence coincide. 3.3 The Lattice of Dependencies Since the knowledge of the dependence value of an itemset of cardinality k, together with the values of the joint probabilities of all its subsets of cardinality k 1, is sufficient to know the probabilites of all the combinations of its values, the lattice of the itemsets can be adopted to describe the whole system of dependencies of a given database. Of course, in such a lattice every node should be labeled with its associated dependence value. Besides, the nodes at the top of the lattice representing the itemsets of cardinality 1 will be labeled with the values of the differences between the probability estimates, P(a) O(a)/n, P(b) O(b)/n,..., and so on, and the corresponding starting estimates (typically, and in absence of other estimates, equal to 0.5).

15 394 R. Meo c t d ct cd td Fig. 7. ctd The lattice relative to the purchases of coffee (c), tea (t) and doughnuts (d). By way of example, Figure 7 represents the dependence lattice relative to the sample reported by Silverstein et al. in their paper. The following are the data of purchases of coffee (c), tea (t), and doughnuts (d) and their combinations proposed by those authors. P c, t, d 0.08 P c, t, d 0.01 P c, t, d 0.4 P c, t, d 0.02 P c, t, d 0.1 P c, t, d 0.02 P c, t, d 0.35 P c, t, d The dependence values of the nodes of the lattice have been calculated as follows. c P c P c MI O c /n t P t P t MI O t /n d P d P d MI O d /n c, t P c, t P c, t MI O c, t /n P c P t c, d P c, d P c, d MI O c, d /n P c P d t, d P t, d P t, d MI O t, d /n P t P d c, t, d P c, t, d P c, t, d MI O c, t, d /n

16 Theory of Dependence Values 395 Fig. 8. The sign of dependence function for the example of the purchases of coffee (variable C), tea (variable T) and doughnuts (variable D). P c, t, d MI has been computed maximizing the entropy E(x) with x P(c,t,d) as suggested in Definition 8 on the maximum independence estimate. Notice that from the value of (c, t, d) and from Definition 10 on the state of dependencies it follows, for example, that the dependence of itemset {c,t,d} ispositive, whereas, by adopting the model proposed by Silverstein et al., the same dependence, evaluated as P a, b, c P a P b P c would be negative. This is due to the fact that, in the Silverstein et al. model, the dependencies which the subset {c, t, d} has inherited from the subsets {c,t}, {c,d} and {t,d} are not distinguished from the specific inherent dependence. The complete dependence table showing the dependence function signs for all the values of c, t, d is shown in Figure 8. The dependence lattice can also be viewed as a useful tool to display the results of a data mining investigation on a given database. Of course, it will be convenient to display only the sublattice of the nodes having sufficient support and positive or negative dependencies anyway different from zero in a significant way. Often, the dependence value is not necessary, it being sufficient to introduce the indication ( o -) of the dependence state in the lattice produced. 3.4 Dependence Rules for k-plets of Items of Arbitrary Cardinality The case of triplets discussed in Section 3.2 is absolutely general. However, for the sake of completeness, the definitions and the theorems presented in Section 3.2 are extended here to the more general case of k-plets of arbitrary cardinality. For brevity, the proofs of the theorems are omitted, with the exception of Theorem 7, which needs a specific proof. Consider the case of a k-plet of Boolean variables I 1, I 2,...,I k, and assume we know all the joint probabilities up to the order (k 1): P i 1, P i 2,...,P i k 1, P i k ; P i 1, i 2, P i 1, i 3,...,P i k 1, i k ;...; P i 1, i 2,..., i k 1,...,P i 2, i 3,..., i k 1, i k. We want to determine the kth-order joint probabilities like P i 1, i 2,..., i k 1, i k, P i 1, i 2,..., i k 1, i k, and so on. The

17 396 R. Meo following theorem shows that knowledge of a single kth-order joint probability is sufficient to determine all the kth-order probabilities. THEOREM 7(UNICITY OF THE VALUE). All the kth-order joint probabilities can be calculated as functions of the joint probabilities of the orders less than k and a single kth-order joint probability. PROOF. Assume, for example, we know P i 1, i 2,..., i k 1, i k. First, we determine P i 1, i 2,..., i k 1, i k P i 2,..., i k 1, i k. P i 1, i 2,..., i k 1, i k. Analogously, we determine all the other joint probabilities related to elementary conditions in which a single literal is complemented: P i 1, i 2, i 3,..., i k 1, i k P i 1, i 3,..., i k 1, i k P i 1, i 2, i 3,..., i k 1, i k..., P i 1, i 2, i 3,..., i k 1, i k and so on. Then, we compute all the joint probabilities referring to elementary conditions in which two literals appear complemented: P i 1, i 2, i 3,..., i k 1, i k P i 1, i 3,..., i k 1, i k P i 1, i 2, i 3,..., P i 1, i 2, i 3,..., i k 1, i k and so on. In general, in order to determine all the joint probabilities related to elementary conditions containing m complemented literals, we apply the following relationship in which a 1 a 2... a m b 1 b 2... b k m and 1 a i k and 1 b j k, with 1 i m and 1 j k m. P i a1, i a2,..., i am, i b1,..., i bk m P i a1, i a3,..., i am, i b1,..., i bk m P i a1, i a2, i a3,..., i am, i b1,..., i bk m, where at most (m 1) complemented literals appear in the right size. e

18 Theory of Dependence Values 397 Definition 12. (Maximum Independence Estimate). If the joint probabilities up to order k-1 are known but no information is available on the joint probabilities of order k, then the conjecture on P i * 1, i * 2,..., i * k maximizing the joint entropy of I 1, I 2,..., I k, E P i 1 *, i 2 *,..., i k * logp i 1 *, i 2 *,..., i k * is considered as the maximum independence estimate. For any i 1 *, i 2 *,..., i k * the maximum independence estimate is indicated with the symbol P i 1 *, i 2 *,..., i k * MI. Definition 13. (Dependence Value). The difference P i 1, i 2,..., i k P i 1, i 2,..., i k MI is defined as the dependence value of the itemset i 1, i 2,..., i k. THEOREM 8(UNICITY OF THE VALUE). If the joint probabilities up to the order k 1 are known, the knowledge of the dependence value P i 1, i 2,..., i k P i 1, i 2,..., i k MI is sufficient to describe all the kth order joint probabilities. Definition 14. (Dependence State). If the absolute value of the dependence value of I 1, I 2,..., I k exceeds a given threshold th, I 1, I 2,..., I k are defined as connected by a dependence of order k. If th, the dependence is defined as positive; otherwise, it is defined as negative. The following notation, D k I 1, I 2,..., I k 0 D k I 1, I 2,..., I k 0 D k I 1, I 2,..., I k 0, is used to indicate the existence of a dependence of order k and its sign. Definition 15. (Dependence Function). The Boolean function of variables I 1, I 2,...,I k, whose minterms correspond to the values v 1, v 2,...,v k for which P I 1 v 1, I 2 v 2,...,I k v k P I 1 v 1, I 2 v 2,...,I k v k MI is called the dependence function of variables I 1, I 2,...,I k.

19 398 R. Meo THEOREM 9(PARITY OF DEPENDENCE FUNCTIONS). If D k I 1, I 2,..., I k 0, the dependence function of variables I 1, I 2,...,I k is I 1 Q I 2 Q...I k, that is, the parity function with even parity. If D k I 1, I 2,..., I k 0, the dependence function of variables I 1, I 2,...,I k is I 1 Q I 2 Q... Q I k, that is, the parity function with odd parity. In both cases, and for all values of I 1, I 2,..., I k the difference P i 1, i 2,..., i k P i 1, i 2,..., i k MI has an absolute value equal to. 4. ENTROPY AND DEPENDENCIES A less intuitive but for some aspects more effective approach to determine dependencies can be based on the concept of entropy. In this section only a summary of a possible entropy based theory of dependencies is presented, the task being left to the reader of developing such a theory following the scheme of Section 3. First consider the case of the pairs of items. Assume P(a), P(a ), P(b), P(b ) are known. The entropy of A E A P a logp a P a logp a is the measure of the average information content of the events a A TRUE and a A FALSE. An analogous meaning can be attributed to E B P b logp b P b logp b. Now consider the mutual information where I A;B E A E A B, E A B a *, b * P a *, b * logp b * a * I A;B E A E A B E B E B A. is a measure of the average information content carried by b * on the value of A, and vice versa, and therefore, it can be assumed as an indication of independence of A and B. Unfortunately, mutual information I(A;B) is always positive; so it is necessary to verify whether P(a,b) P(a) P(b), in order to determine the sign of dependence. Therefore, we propose the following definition:

20 Theory of Dependence Values 399 Definition 16. (Entropy-Based Second-Order Dependence). If I(A;B) exceeds a given threshold, we state that D 2 A, B 0 or D 2 A, B 0 according to whether P(a,b) P(a) P(b) or not. If I(A;B) does not exceed that threshold, we state that D 2 A, B 0. The extension of such definition to triplets is not immediate, since the ternary mutual information I(A;B;C) defined in information theory does not own the meaning we need now. Therefore we suggest the following. Definition 17. (Entropy-Based Third-Order Dependence). If both E A B E A B, C and E A C E A B, C exceed a given threshold, we state that D 3 A, B, C 0 when P(a,b,c) is larger than all the following estimates. P a P b, c P b P a, c P c P a, b or D 3 A, B, C 0 when P(a,b,c) is less than all three preceding estimates; otherwise, D 3 A, B, C 0. Definition 17 might seem asymmetric with respect to variables A, B, and C. In order to understand the reasons for which the relationships written in Definition 17 are symmetric, remember that, for example, if E A B E A B, C th then also E C B E C A, B th. In addition, E A B, C E A B and E A B, C E A C. When, for example, E A B E A B, C, E A B, C is maximum and, therefore, E A, B, C E B, C E A B, C is also maximum (maximum independence condition). It follows that E(B A,C) and E(C A,B) also take their maximum values, equal to E(B A) or E(B C) and to E(C A) or E(C B). An analogous definition can be introduced to evaluate dependencies in k-plets with arbitrary k. k Definition 18. (Entropy-Based General Dependence). Let E MIN be the minimum of all the conditional entropies E A C, D,..., Z E A B, D,..., Z..., where {C,D,...,Z}, {B,D,...,Z},... are the subsets of (k 2) variables of the set B, C, D,..., Z of (k 1) variables.

21 400 R. Meo k If E MIN E A B, C, D,..., Z (where B,C,D,...,Z is ) exceeds a given threshold, and P(a,b,c,...,z) is larger than the maximum of the estimates P a P b, c,..., z P b P a, c,..., z P c P a, b,..., z..., we state that D k A, B, C,..., Z 0. k If the same difference E MIN E A B, C, D,..., Z exceeds the given threshold and P(a,b,c,...,z) is smaller than the minimum of the estimates P a P b, c,..., z P b P a, c,..., z P c P a, b,..., z..., we state that D k A, B, C,..., Z 0. If neither of the two above specified conditions holds, we state that D k A, B, C,..., Z 0. Notice that the computation of entropies can be simplified by applying the following theorem. THEOREM 10. If the entropies of order k 1 are known, a single entropy needs to be determined in order to calculate all the entropies of order k. PROOF. Assume we know E I 1 I 2, I 3,..., I k. First we determine E I 2 I 1, I 3,..., I k by observing that I I 1 ;I 2 I 3,..., I k E I 1 I 3, I 4,..., I k E I 1 E I 2 E I 2 I 2, I 3,..., I k I 3, I 4,..., I k I 1, I 3,..., I k, where only the last entropy is unknown. A similar method can be applied to determine all other conditional entropies of the type

22 Theory of Dependence Values 401 E I j I 1, I 2,..., I j 1, I j 1,..., I k (with 2 j k). Finally, any other entropy can be easily calculated in terms of those already calculated. For example, E I 1, I 2 I 3, I 4,..., I k E I 2 I 3, I 4,..., I k E I 1 I 2, I 3,..., I k. e 5. DETERMINING DEPENDENCE VALUES The analysis developed in this article essentially refers to the concept of confidence and not to the principles of support. Almost all the algorithms for data mining proposed thus far are based on a first step to determine which k-plets have sufficient support, that is, sufficient statistical relevance. Such solutions are compatible with the following algorithm for determining all the relevant dependencies up to a certain order. (1) Determining the k-plets that have sufficient support. Most algorithms for determining the k-plets that have sufficient support proceed in order of increasing cardinalities. In other words, they first determine the single items, then the pairs of items, the triplets, and so on. Such algorithms are well suited to the following procedure. The algorithms should be modified in order to examine a k-plet P after the (k 1)-plets contained in P. The program, which was developed specifically to verify the ideas described in this article, is based on an algorithm for determining the k-plets (also called itemsets) that have sufficient support [Meo 1999]. It was chosen for its speed, but produces the list of itemsets organized in a family of trees. However, this data structure, as any other, can be transformed into a lattice suitable for the computations described above in a relatively short time. Note that it is not necessary that the complete lattice of all the itemsets with sufficient support be represented in the main memory at the same time. What is really needed is that, at the starting point, every node of the structure, i.e., very analyzed itemset, is represented by two sets of data: (a) the values of the joint probabilities describing that itemset; (b) the pointers to its parents. For simplicity, as concerns point (a) above, the program described here is characterized by its choice in describing an itemset with a single datum (by virtue of Theorem 7 on the unicity of the value). The chosen datum is the number of occurrences n ab...z of that itemset, which is proportional to its probability P(a,b,...,z) (see Figure 9, where ptr x denotes the pointer to itemset x).

23 402 R. Meo a n a b n b c n c ab n ab ac n ac bc n bc ptr a ptr b ptr a ptr c ptr b ptr c abc n abc ptr ab ptr ac ptr bc Fig. 9. Data structure of the itemsets. This choice makes it possible to store millions of itemsets in the main memory at the same time and to perform all the following computations without storing any partial results in the mass memory. (2) Determining all the joint probabilities of an itemset. The computation of the joint probabilities P i * 1, i * 2,..., i * k for all the combinations of values of i * 1, i * 2,..., i * k can be performed recursively by applying the relationships presented in the proof of Theorem 7 on the unicity of the value. Of course, recursion proceeds towards the parents and the grandparents. For example, in the case of Figure 9, P a, b, c P a, c P a, b, c P a P a, c P a, b P a, b, c where only the probabilities directly connected to the numbers of occurrences introduced in Figure 9 appear. (3) Determining maximum independence estimates. The determination of the value x for which the joint entropy E x P i 1 *, i 2 *,..., i k * log P i 1 *, i 2 *,..., i k * takes its maximum value can be performed numerically, at the desired level of accuracy with conventional interpolation techniques. (4) Computing the dependence value and states. A direct application of Definitions 13 and 14 leads to the final results.

24 Theory of Dependence Values 403 Table I. Number of Itemsets and Their Average Lengths in the Experiments Total Itemset Number Nominal Average Actual Average Experimental Evaluation The proposed approach has been verified with an implementation in C using the Standard Template Library. The program has been run on a PC Pentium II, with a 233 MHz clock, 128 MB RAM, and running Red Hat Linux as the operating system. We worked on a class of databases taken as a benchmark on association rules by most data mining algorithms the class of synthetic databases that project Quest of IBM generated for its experiments (see Agrawal and Srikant [1994] for details). We experimented many times on several databases with different values for the main parameters and for minimum support; but the results were all similar to the ones presented here. In particular, the experiments were run with the value for minimum support equal to 0.2% and with a precision in the computation of the dependence values equal to In generating the databases, we adopted the same parameter settings proposed for synthetic databases: D, the number of transactions in the database, fixed to 100,000; N, the total number of items, fixed to 1000; T, the average transaction length, fixed to 10, since its value does not influence program behavior. On the contrary, I, the average length of the frequent itemsets, was varied, since its value really determines the depth of the lattice to be generated. Each database contains itemsets with sufficient support, with a different average lengths (3, 4,..., 8). The extreme values of the interval [2 10] were discarded for the following reasons. The low value was discarded because it does not make much sense to maximize the entropy related to itemsets having only two items. The direct approach, which compares the probability of such an itemset with the product of the probabilities of the two items, was adopted in this case. High values were avoided because the longer the itemsets, the more probability they have of being under the threshold for minimum support. In this case, too few itemsets are shown to be over the threshold, and comparing the different experiments is not fair. Hence, even if the nominal average itemset length is increased, the actual average length of itemsets with sufficient support is shown as being significantly lower. Table I reports the total number of itemsets with sufficient support and also their nominal and actual average lengths. In Figure 10, two execution times are shown: T 1 is the CPU execution time needed for identifying all the itemsets with sufficient support; T 2 is

25 404 R. Meo Execution times per itemset T1 T CPU time[s] Average itemset length Fig. 10. Experimental results. the time for computing the dependence values of the itemsets identified previously. Both times have been normalized with respect to the total number of itemsets, since this changes considerably in the various experiments. Notice that time T 1 decreases with the actual average length of the itemset. This is a peculiarity of the algorithm (called Seq) adopted for the first step, because during its execution it builds temporary data structures that do not depend on the length of the itemset. Furthermore, Seq proved suitable for very large databases and for searches characterized by very low resolution values. Time T 2, to the contrary, increases with the average length of the itemset because the depth of the resulting lattice increases. This fact is not surprising. On the other hand, note that the increments are moderate, with the exception of the experiment with the average itemset length equal to 4.82 (corresponding to a nominal average itemset length equal to 6). In that experiment, as Table I shows, the total number of itemsets that exceed the minimum support threshold, compared to the itemset length, grows suddenly. The lattice generated in these conditions is very heavily populated. Hence, the high dimension of the output explains the result. Finally, these experiments demonstrate that this new approach to knowledge discovery on itemset dependencies is feasible and suitable for the high-resolution research typical of data mining.

26 Theory of Dependence Values CONCLUSIONS We show in this article that a single real the dependence value contains all the information on dependencies relative to a given itemset. In addition, by virtue of the theorem that states that dependence functions are always parity functions, the combinations of values for determining which dependencies are positive or negative is immediate. In addition, for practical cases, the feasibility of this new theory is demonstrated in a set of experiments on various databases. Some themes are worth developing further. The first concerns the maximum independence estimates. Is it possible to find a closed formula giving the maximum independence probability of a given itemset as a function of lower-level probabilities? The second theme is defning confidence levels. Which percentage of the probability P(a,b,...) must be exceeded by the dependence value in order to state that the dependence is strong? This question has not been discussed adequately in the support confidence framework, but the model introduced here is probably more suitable for sound theoretical analysis. A third area for investigation concerns the algorithms for determining the dependence values. The method proposed in this article assumes that the itemset with sufficient support was determined by adopting one of the known methods and performing the computation of the dependence values on it. However, it is likely that an integrated method combining the two steps is more rapid and effective. Theoretical analysis based on probability and information theory, as well as the development of new algorithms, should be combined and integrated in this research. REFERENCES AGGARWAL, C.C.AND YU, P. S Online generation of association rules. In Proceedings of the 14th International Conference on Data Engineering (Orlando, FL). IEEE Computer Society Press, Los Alamitos, CA, AGRAWAL, R., MANNILA, H., SRIKANT, R., TOIVONEN, H., AND VERKAMO, A. I Fast discovery of association rules. In Knowledge Discovery in Databases, P. S. U. M. Fayyad, G. Piatetsky-Shapiro, and R. U. Eds, Eds. AAAI Press, Menlo Park, CA. AGRAWAL, R. AND SRIKANT, R Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB 94, Santiago, Chile, Sept.). VLDB Endowment, Berkeley, CA, AGRAWAL, R., IMIELINSKI, T., AND SWAMI, A Database mining: A performance perspective. IEEE Trans. Knowl. Data Eng. 5, 6 (Dec.), AGRAWAL, R., IMIELINSKI, T., AND SWAMI, A Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD 93, Washington, DC, May 26-28), P. Buneman and S. Jajodia, Eds. ACM Press, New York, NY, BRIN, S., MOTWANI, R., ULLMAN, J.D., AND TSUR, S Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 97, Tucson, AZ, May 13 15), J. M. Peckman, S. Ram, and M. Franklin, Eds. ACM Press, New York, NY, HAN, J. AND FU, X Discovery of multiple-level association rules from large databases. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB 95, Zurich, Sept.).

27 406 R. Meo HAN, J., CAI, Y., AND CERCONE, N Knowledge discovery in databases: An attributeoriented approach. In Proceedings of the 18th International Conference on Very Large Data Bases (Vancouver, B.C., Aug.). VLDB Endowment, Berkeley, CA, HOUTSMA, M. A. W. AND SWAMI, A Set-oriented mining for association rules in relational databases. In Proceedings of the IEEE 11th International Conference on Data Engineering (Taipei, Taiwan, Mar.6-10). IEEE Press, Piscataway, NJ. IMIELINSKI, T From file mining to database mining. In Proceedings of the ACM SIGMOD International Workshop on Data Mining and Knowledge Discovery (SIGMOD-96, Aug.), R. Ng, Ed. ACM Press, New York, NY, IV, J.F.E.AND PREGIBON, D A statistical perspective on kdd. Tech. Rep. KDD THOMAS, G., KAWAGOE, K., KRISHNAMURTHY, R., IMIELINSKI, T., REINER, D., AND WOLSKI, A Practitioner problems in need of database research. SIGMOD Rec. 20, 3 (Sept.), LIN, D. AND KEDEM, Z Pincer-search: A new algorithm for discovering the maximum frequent set. In Proceedings of the 6th International Conference on EDBT (EDBT, Valencia, Spain, Mar.). MEO, R A new approach for the discovery of frequent itemsets. In Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery (Florence, Aug./Sept.) PARK, J.S.,SHEN, M., AND YU, P. S An effective hash based algorithm for mining association rules. In Proceedings of the ACM SIGMOD Conference on Management of Data (ACM-SIGMOD, San Jose, CA, May). SIGMOD. ACM Press, New York, NY. SAVASERE, A., OMIECINSKI, E., AND NAVATHE, S An efficient algorithm for mining association rules in large databases. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB 95, Zurich, Sept.). SILVERSTEIN, C., BRIN, S., AND MOTWANI, R Beyond market baskets: Generalizing association rules to dependence rules. Data Mining Knowl. Discovery 2, 1, SRIKANT, R.AND AGRAWAL, R Mining quantitative association rules in large relational tables. In Proceedings of the ACM-SIGMOD International Conference on Management of Data (ACM-SIGMOD, San Jose, CA, May). ACM Press, New York, NY. SRIKANT, R. AND AGRAWAL, R Mining generalized association rules. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB 95, Zurich, Sept.). TOIVONEN, H Sampling large databases for association rules. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB 96, Bombay, Sept.). Received: October 1998; revised: September 1999; accepted: January 2000

A new Model for Data Dependencies

A new Model for Data Dependencies Rosa Meo Università di Torino Dipartimento di Informatica Corso Svizzera 185-10149 Torino, Italy meo@di.unito.it ABSTRACT A new model to evaluate dependencies between