On Minimal Infrequent Itemset Mining

Size: px
Start display at page:

Download "On Minimal Infrequent Itemset Mining"

Transcription

1 On Minimal Infrequent Itemset Mining David J. Haglin and Anna M. Manning Abstract A new algorithm for minimal infrequent itemset mining is presented. Potential applications of finding infrequent itemsets include statistical disclosure risk assessment, bioinformatics, and fraud detection. This is the first algorithm designed specifically for finding these rare itemsets. Many itemset properties used implicitly in the algorithm are proved. The problem is shown to be N P-complete. Experimental results are then presented. I. INTRODUCTION Because of its importance to finding association rules, much attention has been given to the problem of finding itemsets that appear frequently in a dataset (cf. [1] [4]). There have been hundreds of papers as well as workshops at conferences (e.g. FIMI 3 and FIMI 4 at IEEE ICDM 3 and IEEE ICDM 4) devoted to this subject. The definition of frequent often includes the notion of an integer threshold parameter, τ, delineating those itemset patterns considered frequent appearing at least τ times in the dataset from those patterns considered infrequent. While it is possible to consider very small values of τ, the focus is most often placed on finding very frequent patterns. Relatively less attention has been paid to infrequent itemsets. Yet they have many potential applications, including: 1) statistical disclosure risk assessment where rare patterns in anonymized census data can lead to statistical disclosure; 2) bioinformatics where rare patterns in microarray data may suggest genetic disorders; and 3) fraud detection where rare patterns in financial or tax data may suggest unusual activity associated with fraudulent behavior. In this paper we present a new algorithm for finding minimal infrequent patterns. This is the first algorithm designed specifically for finding minimal infrequent itemsets. It is based upon the SUDA2 algorithm developed for finding minimal unique itemsets (itemsets with no unique proper subsets) [8], [9]. We then show that the minimal infrequent itemset problem is N P-complete. Finally, experimental results are presented. II. PROBLEM SPECIFICATION Let I = {i 1,i 2,...,i L } be a set of items. An itemset is a subset I I. The cardinality of I, denoted by I, is the number of items in the itemset. As a shorthand, we will David J. Haglin is with the Department of Computer and Information Sciences, Minnesota State University, Mankato, MN 561, USA (david.haglin@mnsu.edu), fax: Anna M. Manning is with the School of Computer Science, University of Manchester, Oxford Rd., Manchester, M13 9PL, UK (anna@manchester.ac.uk), fax: write c-itemset to mean an itemset of cardinality c. A dataset, D = {t 1,t 2,...,t R }, is a collection of R transactions (sometimes called records) of the form t i = (i,t i ), where i is the transaction identifier () and T i I. We denote by D the number of transactions in the dataset. Given an itemset I, a transaction T is said to contain I if I T. The support set of an itemset I with respect to the dataset D is D(I) = {t i D : I T i }. The support of an itemset I in dataset D is the cardinality of the support set of I. That is, Supp D (I) = D(I). The relative support of an itemset, defined as Supp D (I)/ D, is a number between and 1 inclusive. Given a dataset D and an integer threshold τ, we say an itemset I is: τ-occurrent if D(I) = τ τ-frequent if D(I) τ τ-infrequent if D(I) < τ To describe an itemset as unique, we can either say it is 1-occurrent or it is 2-infrequent. In addition, we say an itemset is: minimal τ-occurrent if it is τ-occurrent and all of its proper subsets are (τ + 1)-frequent; minimal τ-infrequent if it is τ-infrequent and all of its proper subsets are τ-frequent; maximal τ-occurrent if it is τ-occurrent and all of its proper supersets are τ-infrequent; and maximal τ-frequent if it is τ-frequent and all of its proper supersets are τ-infrequent. Since there are datasets known to produce exponentially many τ-frequent itemsets [7], many strategies to compress the output have been considered. One obvious strategy is to find only maximal τ-frequent itemsets. Similarly, for τ- infrequent itemsets, it may be enough to find only minimal τ-infrequent itemsets. We assume that the input is given in binary matrix form where the number of rows is R, the number of columns is L, and an entry at (x,y) is a 1 if and only if the transaction whose = x contains item y. III. ALGORITHM MINIT We recently introduced an algorithm called SUDA2 [8], [9], which finds minimal unique itemsets (MUIs) in a dataset with different properties than those defined above. With certain parameter settings both SUDA2 and our new algorithm should find MUIs. However, the input datasets differ enough to render comparing running times between these two algorithms meaningless.

2 A. Dataset differences between MINIT and SUDA2 The easiest way to describe the differences in dataset properties is to consider the matrix form. For traditional itemset mining, the matrix consists of binary entries. But for SUDA2, the matrix entries can contain any integer. We can transform a SUDA2-type matrix into a binary matrix by enumerating all of the <column, value> pairs. For each of these pairs, a column is created in the transformed binary matrix. For every value in a column in the SUDA2-type input matrix, the corresponding <column, value> location in the transformed binary matrix is given a one. For example, if the first column of the SUDA2-type matrix contains integers in the range of to 2, then the transformed matrix will have three first columns with a 1 in the specific column indicating the integer value in the SUDA2- type matrix. The essential difference between the SUDA2- type matrix and the traditional binary matrix datasets is the added constraint that among collections of columns such as the three columns corresponding to the first SUDA2-type column of to 2 values there is exactly one 1 value and the rest are values in every row. B. The MINIT algorithm Our new algorithm can be adapted to handle the more traditional dataset definition and to handle finding minimal τ-infrequent itemsets (MIIs). We call this adaption MINIT, for MINimal Infrequent itemsets. Initially, a ranking of items is prepared by computing the support of each of the items and then creating a list of items in ascending order of support. Minimal τ-infrequent itemsets are discovered by considering each item i j in rank order, recursively calling MINIT on the support set of the dataset with respect to i j considering only those items with higher rank than i j, and then checking each candidate MII against the original dataset. One mechanism that can be used to consider only higher-ranking items in the recursion is to maintain a liveness vector indicating which items remain viable at each level of the recursion. The initial call to the recursive algorithm presented in Algorithm 1 is MINIT(D, V [1:L], maxc). The liveness vector, V [1:L], is initialized to all true values and must be passed by value to lower levels of the recursion as a unique copy of this vector is required at every node in the recursion tree. For those inputs that require prohibitively large running times, supplying a limit for maxc may result in enough useful information computable within a reasonable amount of time. For those easier datasets, setting maxc to L will find all MIIs. A significant computational effort of MINIT is to search through the dataset for transactions that hold a specific item. To help this search occur quickly, we pre-process the dataset by building linked lists of s for each item. Essentially, we pre-compute D({i j }) for 1 j L by creating linked lists of pointers to the transactions. We also arrange the items in ascending order by support, which is required in order for MINIT to work correctly. Note that some of the items may Algorithm 1 MINIT(D, V [1:L], maxc) 1: Input: D = input dataset with N rows and L columns of binary numbers 2: Input: V [1:L] = a boolean vector indicating viability of each item 3: Input: maxc = upper bound on cardinality of MII to find in the search 4: Returns: A listing of all MIIs for dataset D with cardinality less than maxc 5: compute R list of all items in D in ascending order of support 6: if maxc == 1 /* stopping condition */ then 7: Return all items of R that appear less than τ times in D 8: else 9: M 1: for each item i j R do 11: D j D({i j }) 12: V [i j ] false 13: C j recursive call to MINIT(D j, V, maxc-1) 14: for each candidate itemset I C j do 15: if I {i j } is an MII in D then 16: M M (I {i j } ) 17: end if 18: end for 19: end for 2: Return M 21: end if be discarded (considered non-viable) in this pre-processing phase for reasons such as having a support equal to D (an item appearing in every transaction cannot be part of any MII). Whenever MINIT descends one level of the recursion a new sub-dataset is built to represent the support set D({i j }). In the interest of memory efficiency, MINIT maintains only one copy of the dataset. A linked list of s for those transactions in the sub-dataset is constructed to represent a sub-dataset. To support the recursion of MINIT, the new sub-dataset must be pre-processed. There are three tasks required to perform the pre-processing: 1) computing the support of each item, which is needed to produce a rank-ordering of the viable items by support within D({i j }); 2) determining the viability of each item (i.e., pruning some of the items from consideration); and 3) computing the support set of each viable item resulting in a memory efficient representation of the lists of s for each support set D({i j,i k }) for 1 k L, j k. Observe that the support set of {i k } within D({i j }) is the same as D({i j,i k }). In practice, MINIT can perform the first and third tasks concurrently to avoid two passes over the data. D({i j,i k }) can be computed and the size of that list is then used as

3 the support of i k in D({i j }). The second task can then be performed and the data structures built for the support sets of those pruned items are discarded. When most of the items remain viable, the technique of computing all support sets and deriving support from them is very effective. However, if many of the items are discarded as non-viable, building the support set lists for these discarded items is wasted effort. The statement at line 7 of Algorithm 1 can be modified to return all items of R that appear exactly τ times in D which transforms MINIT into an algorithm that finds all τ-occurrent rather than all τ-infrequent itemsets. IV. MINIMAL ITEMSET PROPERTIES We present properties of MIIs that MINIT relies upon in order to correctly find all MIIs. These properties are adapted from those presented in [9] for the SUDA2 algorithm and dataset characteristics. Consider a minimal τ-infrequent c-itemset I. By definition, I must have the following property. Property 1 (Rareness Property): If itemset I is a MII, then supp D (I) < τ. We note that if an itemset is minimal τ-infrequent, then it must be minimal δ-occurrent for some δ < τ. However, a minimal δ-occurrent c-itemset, I, is not necessarily minimal τ-infrequent for all τ > δ; there may be some size (c 1) subset of I with support ǫ, for δ < ǫ < τ. The following theorem shows that certain itemsets must exist within the dataset, D, in order for an MII to exist. We call the transactions holding those certain itemsets support rows. Theorem 2 (Support Row Property): Given a minimal τ- infrequent c-itemset I = {i 1,i 2,...,i c }, with Supp D (I) = δ, δ < τ, for each 1 j c there must exist τ δ support rows in D containing itemset I {i j } (but not item i j ). Proof: Suppose for some j there exists fewer than τ δ rows containing I = I {i j } and not i j. Then I, which exists in D(I), has support Supp D (I ) < τ, and therefore I is not minimal τ-infrequent. Observe that a support row works for only one item in I, thus there are at least c(τ δ) support rows. The existence of c(τ δ) support rows in D is both a necessary and sufficient condition for I to be minimal, as seen by the previous theorem and the following lemma. Lemma 3: Given an itemset I = {i 1,...,i c } that is τ- infrequent in D, with Supp D (I) = δ for δ < τ, and D has c(τ δ) support rows containing the c subsets of I of size c 1, then I is a minimal τ-infrequent itemset. Proof: Suppose I is not minimal. Then there exists J I that is also τ-infrequent within D. Clearly, from Theorem 2, J < c 1. However, it must be true that J I, where I is one of the c subsets of I of size c 1. Since J I, J appears in the δ rows holding I. Moreover, since J I, Theorem 2 states that J must appear in the τ δ support rows for I. Thus, J appears in at least τ rows in D so is not τ-infrequent. This leads to the following observation as to the minimum support required of a single item in order to appear in a τ- infrequent c-itemset. Theorem 4 (Minimum Support Property): Given a fixed τ and itemset cardinality c, an item i must have support Supp D ({i} ) c + τ 2 in order for i to be part of a minimal τ-infrequent c-itemset I. Proof: Let I be a minimal τ-infrequent itemset with I = c and Supp D (I) = δ. If i I, then i must appear in at least the δ reference rows of I and (c 1)(τ δ) support rows for the c 1 subsets of I that contain i and have cardinality c 1. Thus, Supp D ({i} ) δ + (c 1)(τ δ) = c(τ δ) + (2δ τ) = cτ τ + 2δ cδ Let δ = τ r, for some integer r >. Then, Supp D ({i} ) cτ τ + 2(τ r) c(τ r) = τ + r(c 2) For a fixed c > 2 and τ, this is minimum when r = 1 (i.e., δ = τ 1). The Minimum Support Property can give an efficient way to prune significant areas of the search space if several items have low support counts. Corollary 5 (Uniform Support Property): Given a dataset D and item i contained in every transaction of D, then i cannot be contained in any minimal τ-infrequent itemset I. Proof: If I is a minimal τ-infrequent itemset in D containing item i, then the Support Row Property ensures the existence of a row in D containing I {i} and not i. Since i appears in every transaction in D we have a contradiction. V. RECURSIVE ITEMSET PROPERTIES Given dataset D and some item a (called an anchor item), we consider the itemset properties of D and D({a} ) based upon the recursive algorithm 1. Lemma 6: Given I = {i 1,...,i c } is a minimal τ- infrequent itemset in D, for each anchor a = i j, 1 j c, the itemset I a = I {i j } is a minimal τ-infrequent itemset in D({a} ). Moreover, I a = I 1. Proof: Without loss of generality, fix j in the range of 1 to k inclusive. Let a = i j. Since every row in D({a} ) contains item a, the only way I a could appear at least τ times in D({a} ) is if I appeared in at least τ rows of D. Therefore, I a is τ-infrequent in D({a} ). Similarly, if I a is not minimal τ-infrequent in D({a} ), then there exists I a I a that is also τ-infrequent in D({a} ).

4 But I a {a} would also be τ-infrequent in D and is a proper subset of I. Hence, I a must be minimal τ-infrequent in D({a} ). Unfortunately, it is not the case that all minimal τ- infrequent itemsets in D({a} ) lead to minimal τ-infrequent itemsets in the original dataset D. However, the following theorem provides a method for finding those that do. Theorem 7: Given a dataset D, an item a, and I a as a minimal τ-infrequent itemset in D({a} ) with Supp D({a} ) (I a ) = δ, the itemset I = I a {a} is a minimal τ-infrequent itemset in D if and only if there exists τ δ rows in D containing I a but not containing item a. Proof: If I is a minimal τ-infrequent itemset in D with Supp D (I) = δ, then the Support Row Property ensures the existence of τ δ rows in D containing I a but not containing item a. For the other direction, assume τ δ rows exist in D containing I a but not a and that I a is a minimal τ-infrequent itemset in D({a} ). Note that Supp D (I) = δ. All that is required to show is that I has the requisite c(τ δ) support rows. Each of the (c 1)(τ δ) support rows in D({a} ), augmented with item a, form a support row in D for the itemset I. As all of these (c 1)(τ δ) support rows contain item a, the only other support rows needed to ensure I is a minimal τ-infrequent itemset in D are the τ δ rows stated in the theorem. Since I is τ-infrequent and since c(τ δ) support rows exist in D, I must be a minimal τ-infrequent itemset in D. The Recursive Property of MIIs helps define the boundaries of the search space by providing a clear indication of the maximum cardinality of candidate MIIs. VI. EXAMPLE To help understand the recursive algorithm and pruning techniques, we present the following example. Consider the input dataset as shown in Table I(a). We will follow the discovery of the 2-occurrent itemset I = {2,4,5}, which consists of ranks: 1,2,4. (a) Dataset 1 1,2, 4, 5,6 2 2,3, 4, 5,6 3 1,2, 3, 4,6 4 1,2, 3, 5,6 5 1,3, 4, 5,6 6 1,3, 6 7 1,2, 5 TABLE I EXAMPLE DATASET (b) Rank Items Rank Item Support Algorithm 1 finds itemset I by first computing the rank order of items as shown in Table I(b). Lemma 6 indicates that any itemset I that is τ-occurrent in D and has smallest item rank 1 (i.e., item {4} ) will have I {4} as a τ-occurrent itemset in D({4} ). So we compute D({4} ) as shown in Table II(a). Note that we can ignore item 6 in D({4} ) by Corollary 5. This brings us down one level in the recursion tree as we explore D({4} ). (a) Dataset 1 1, 2, 4,5, 6 2 2, 3, 4,5, 6 3 1, 2, 3,4, 6 5 1, 3, 4,5, 6 TABLE II DATASET D({4}) (b) Rank Items Rank Item Support As we enter Algorithm 1 recursively, we first construct a new rank item list for the dataset at this recursion level (see Table II(b)). For the second iteration of the loop at line 1 of Algorithm 1, the anchor item will be 2. Descending the recursion tree for this anchor will produce the dataset in Table III(a). (a) Dataset 1 1,2, 4, 5,6 2 2,3, 4, 5,6 3 1,2, 3, 4,6 TABLE III DATASET D({2,4}) (b) Viable items 1 1,5 2 3,5 3 1,3 Note that each of the viable items 1, 3, 5 all have support at or below our threshold τ = 2. So this recursion tree node returns the itemset list {{1}, {3}, {5} } to the next higher recursion node. To determine {2, 5} is a 2-occurrent itemset in D({4} ), we need only find sufficient support rows as described in Theorem 7. Observe that 5 in Table II(a) contains item 5 but does not contain 2. This one support row, along with a support of 2 for item 5 in D({4} ), is enough to conclude that {2, 5} is indeed a 2-occurrent itemset in D({4} ). It will therefore be included in the collection of itemsets passed up to the parent node of the recursion tree. At the root node of the recursion tree, the candidate itemset {2,5} is merged with item {4}. This candidate {2,4,5} is then checked for qualification as a 2-occurrent itemset in D, using Theorem 7. Since {2,5} has support 2 in D({4} ), one support row in D is sufficient. That support row is 4. Since this is the top level of the recursion, we conclude that {2,4,5} is a 2-occurrent itemset in D. We note that for Algorithm 1 to find a minimal τ- infrequent c-itemset, I, it must explore c levels of the recursion tree. Each level of the tree corresponds to one of the items in I. The item associated with the bottom level of the tree has support in that bottom-level dataset equal to the support of I in the original dataset D. In fact, the support remains the same at each level of the recursion tree.

5 VII. COMPLEXITY OF MINIMAL τ -OCCURRENT ITEMSET A. Variations of the problem MINING For a problem such as finding minimal τ-occurrent itemsets, there are variations that have important implications to the complexity of the problem. We consider the following problem variations in increasing order of computational difficulty: 1) The simplest form of a minimal τ-occurrent problem is a decision problem where the objective is, for a given input dataset, to determine if there exists any minimal τ-occurrent itemset and merely answer yes or no. 2) The next harder problem is a search problem where the objective is, for a given input dataset, to find one (any) minimal τ-occurrent itemset and print out the solution. There are actually two sub-variations to the search version of the problem: (i) find any solution for a specific record in the input dataset; and (ii) find any solution in any record in the input dataset. We note that the computationally easier form of the two subvariations is the less restrictive search problem. 3) The objective of the counting version is, for a given input dataset, to determine the number of minimal τ- occurrent itemsets and to print out the number. Even though there may be an exponential number of minimal τ-occurrent itemsets in a given dataset, it may be possible to count them in polynomial time and print out a polynomial-size representation of the exponential count (e.g., in binary representation). 4) The most computationally challenging variation is, for a given input dataset, to find and print out all of the τ-occurrent itemsets. Some attention has recently been given to finding rare patterns in datasets [6], [8], [9]. From this perspective, it makes sense to search for minimal τ-occurrent itemsets. A special case of this problem is to set τ = 1 meaning only minimal unique itemsets are sought. Yang provides a nice complexity analysis of the four variations of the maximal τ-occurrent problem. The counting variation of the maximal τ-occurrent problem is #Pcomplete [7] whereas searching for a single solution is possible to do in polynomial time. We show that by seeking minimal rather than maximal τ-occurrent itemsets, even the simplest variation, the decision version, is N P-complete. B. Computational complexity of minimal τ-occurrent itemsets Our proof is based on a proof presented by Daishin Nakamura in [1] which addresses only the variation of searching in a specific record for a 1-occurrent itemset (i.e., minimal unique itemset). The proof is by a reduction from the Hitting Set problem (see [11] for N P-complete proof techniques). An instance of the Hitting Set problem, H = (p,c,k), is defined as Given a collection C = {C 1,...,C q } of subsets of a finite set S = {1,...,p} and a positive integer k p, determine whether there exists a subset S S with S k such that S contains at least one element from each set of C. Theorem 8: Given a dataset and a fixed constant t 1, to determine if there exists any τ-occurrent itemset in the dataset is N P-complete. Proof: Given an instance of the Hitting Set problem H = (p,c,k) construct a q p matrix: x 1,1 x 1,2... x 1,p x 2,1 x 2,2... x 2,p M =..... x q,1 x q,2... x q,p where x i,j = 1 if j C i and x i,j = otherwise. Observe that every subset of the columns of M corresponds to a subset S S in the Hitting Set problem. Moreover, S is a hitting set in H if and only if the subset of columns whose index numbers are in S induce a matrix projection with at least one 1-entry in every row. This can be seen as each row i of M corresponds to C i in H. Now denote by Z a (t p)-matrix of zeroes. Construct a dataset matrix D: D = Z M. M where the number of copies of M is τ + 1. Now find a minimum τ-occurrent itemset I in D. If I contained any 1-entries, then a record holding I must come from the M portion of D. However, since each row M appears τ + 1 times in D, such an itemset could not be τ-occurrent. So, I consists of only zeros and appears in the first t rows of zeros in D. Let S be the set of columns associated with I. Since I is t-occurrent, each row in M must contain a 1-entry in at least one of the columns in S. This corresponds directly to a solution to H. Therefore, any algorithm for finding τ-occurrent itemsets in a dataset, for any τ 1, can be used to solve the Hitting Set problem. VIII. EXPERIMENTAL RESULTS All of the experiments were run on Dual Core AMD Opteron Processor 27s running at 2GHz with 8GB of RAM. The datasets we use come from All of the datasets are in the proper format for MINIT. A. Mushroom Data The mushroom dataset contains 8124 transactions and 119 items in its inventory. This particular dataset does not challenge MINIT as it can run within 5 seconds for any

6 7 6 delta-infrequent delta-occurrent 1.6e+6 1.4e+6 maxc==6 maxc==7 maxc== e+6 Number of MIIs 4 3 Number of MIIs 1e Fig. 1. Mushroom Dataset Fig. 3. Chess Dataset MII Counts threshold 1 τ 8124 with no restriction on itemset cardinality. What we can see from this dataset is the number of minimal τ-infrequent and minimal τ-occurrent itemsets for varying values of the threshold τ (Figure 1). We expect that the number of τ-occurrent is less than the number of τ- infrequent. Figure 1 shows just how drastically different they are. Since the τ-infrequent itemsets includes all of the τ- occurrent, the remainder of this section focuses only on the computing minimal τ-infrequent itemsets. B. Chess Data The chess dataset contains 3196 transactions and 75 items in its inventory. Although this dataset is much smaller than the mushroom dataset, it presents significantly more of a computation challenge to MINIT. Seconds Fig. 2. Chess Dataset Computation Time maxc==8 We imposed limits on the cardinality of τ-infrequent itemsets. For each maxc of 6, 7, and 8, we ran trials varying the threshold τ. The running time for maxc == 8 is shown in Figure 2. To see the growth pattern, the numbers of minimal τ-infrequent itemsets are shown for maxc of 6, 7, and 8 in Figure 3. C. T1I4D1K Data The T1I4D1K dataset, generated by the software described in [12], contains 1 transactions and 87 items in its inventory. It has an average number of 1 items per transaction and an average support of 4 for each item. Seconds Fig. 4. T1I4D1K Dataset Computation Time We ran trials with no limit on the cardinality of the MIIs varying the threshold from 1 to 1. It is interesting to see that the computation time (Figure 4) drops faster than the number of MIIs (Figure 5). D. Connect4 Data The Connect-4 dataset contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced. There are transactions and 43 columns (one for each of the 42 connect- 4 squares together with an outcome column - win, draw or lose). Once this dataset in transformed into a binary format it contains 129 items since each cell in the original dataset holds one of three possible values. This dataset presents the most computational challenge to MINIT. We imposed a limit on the cardinality of maxc == 6.

7 1.1e+7 1e+7 9e+6 The number of MIIs (Figure 7) has a substantially slower decline than the numbers for the previous datasets. Although the running time also declines more slowly, the decline is even slower than the decline of the MII counts. Number of MinIIs 8e+6 7e+6 6e+6 5e+6 4e+6 3e+6 2e+6 1e Fig. 5. T1I4D1K Dataset MII Counts IX. CONCLUSIONS We present a new algorithm, MINIT, for finding minimal τ-infrequent or minimal τ-occurrent itemsets. The computation time required on the four datasets presented suggest a correlation between the number of MIIs and the amount of computation time required. It would be interesting to see how well MINIT could run in a parallel or grid environment. It would also be useful to find other pruning strategies to improve the running time requirements. ACKNOWLEDGMENT This work was supported in part by the National Science Foundation under grant CTS Seconds Number of MIIs 12 maxc== Fig. 6. Connect4 Dataset Computation Time 85 maxc== Fig. 7. Connect4 Dataset MII Counts REFERENCES [1] R. Agrawal, T. Imielinski, and A. Swami, Mining association rules between sets of items in large databases, in Proceedings of the 1993 International Conference on Management of Data (SIGMOD 93), May 1993, pp [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo, Fast discovery of association rules, in Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Eds. The AAAI Press, Menlo Park, 1996, pp [3] S. Brin, R. Motwani, J. Ullman, and S. Tsur, Dynamic itemset counting and implication rules for market basket data, in Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. ACM Press New York, NY, USA, 1997, pp [4] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New algorithms for fast discovery of association rules. in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1997, pp [5] E. Boros, V. Gurvich, L. Khachiyan, and K. Makino, On the complexity of generating maximal frequent and minimal infrequent sets, in Symposium on Theoretical Aspects of Computer Science, 22, pp [Online]. Available: citeseer.ist.psu.edu/boros2complexity.html [6] D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivonen, and R. S. Sharma, Discovering all most specific sentences, ACM Trans. Database Syst., vol. 28, no. 2, pp , 23. [7] G. Yang, Computational aspects of mining maximal frequent patterns, Theoretical Computer Science, vol. 362, pp , 26. [8] A. M. Manning and D. J. Haglin, A new algorithm for finding minimal sample uniques for use in statistical disclosure assessment, in IEEE International Conference on Data Mining (ICDM5), Nov. 25, pp [9] A. M. Manning, D. J. Haglin, and J. A. Keane, A recursive search algorithm for statistical disclosure assessment, Data Mining and Knowledge Discovery, 27, conditionally accepted. [1] A. Takemura, Minimum unsafe and maximum safe sets of variables for disclosure risk assessment of individual records in a microdata set, Journal of the Japan Statistical Society, vol. 32, no. 1, pp , 22. [11] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Co., ISBN , [12] R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, in VLDB 94: Proceedings of the 2th International Conference on Very Large Data Bases. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1994, pp

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen Book Title Encyclopedia of Machine Learning Chapter Number 00403 Book CopyRight - Year 2010 Title Frequent Pattern Author Particle Given Name Hannu Family Name Toivonen Suffix Email hannu.toivonen@cs.helsinki.fi

More information

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Association Rule. Lecturer: Dr. Bo Yuan. LOGO Association Rule Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Overview Frequent Itemsets Association Rules Sequential Patterns 2 A Real Example 3 Market-Based Problems Finding associations

More information

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Claudio Lucchese Ca Foscari University of Venice clucches@dsi.unive.it Raffaele Perego ISTI-CNR of Pisa perego@isti.cnr.it Salvatore

More information

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise Guimei Liu 1,2 Jinyan Li 1 Limsoon Wong 2 Wynne Hsu 2 1 Institute for Infocomm Research, Singapore 2 School

More information

An Intersection Inequality for Discrete Distributions and Related Generation Problems

An Intersection Inequality for Discrete Distributions and Related Generation Problems An Intersection Inequality for Discrete Distributions and Related Generation Problems E. Boros 1, K. Elbassioni 1, V. Gurvich 1, L. Khachiyan 2, and K. Makino 3 1 RUTCOR, Rutgers University, 640 Bartholomew

More information

Finding All Minimal Infrequent Multi-dimensional Intervals

Finding All Minimal Infrequent Multi-dimensional Intervals Finding All Minimal nfrequent Multi-dimensional ntervals Khaled M. Elbassioni Max-Planck-nstitut für nformatik, Saarbrücken, Germany; elbassio@mpi-sb.mpg.de Abstract. Let D be a database of transactions

More information

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of

More information

Discovery of Functional and Approximate Functional Dependencies in Relational Databases

Discovery of Functional and Approximate Functional Dependencies in Relational Databases JOURNAL OF APPLIED MATHEMATICS AND DECISION SCIENCES, 7(1), 49 59 Copyright c 2003, Lawrence Erlbaum Associates, Inc. Discovery of Functional and Approximate Functional Dependencies in Relational Databases

More information

On Approximating Minimum Infrequent and Maximum Frequent Sets

On Approximating Minimum Infrequent and Maximum Frequent Sets On Approximating Minimum Infrequent and Maximum Frequent Sets Mario Boley Fraunhofer IAIS, Schloss Birlinghoven, Sankt Augustin, Germany mario.boley@iais.fraunhofer.de Abstract. The maximum cardinality

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Juliano Brito da Justa Neves 1 Marina Teresa Pires Vieira {juliano,marina}@dc.ufscar.br Computer Science Department

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Guaranteeing the Accuracy of Association Rules by Statistical Significance

Guaranteeing the Accuracy of Association Rules by Statistical Significance Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge

More information

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli

More information

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries To appear in Data Mining and Knowledge Discovery, an International Journal c Kluwer Academic Publishers

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border Guimei Liu a,b Jinyan Li a Limsoon Wong b a Institute for Infocomm Research, Singapore b School of

More information

Transaction Databases, Frequent Itemsets, and Their Condensed Representations

Transaction Databases, Frequent Itemsets, and Their Condensed Representations Transaction Databases, Frequent Itemsets, and Their Condensed Representations Taneli Mielikäinen HIIT Basic Research Unit Department of Computer Science University of Helsinki, Finland Abstract. Mining

More information

Bottom-Up Propositionalization

Bottom-Up Propositionalization Bottom-Up Propositionalization Stefan Kramer 1 and Eibe Frank 2 1 Institute for Computer Science, Machine Learning Lab University Freiburg, Am Flughafen 17, D-79110 Freiburg/Br. skramer@informatik.uni-freiburg.de

More information

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries Data Mining and Knowledge Discovery, 7, 5 22, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency

More information

Quantitative Association Rule Mining on Weighted Transactional Data

Quantitative Association Rule Mining on Weighted Transactional Data Quantitative Association Rule Mining on Weighted Transactional Data D. Sujatha and Naveen C. H. Abstract In this paper we have proposed an approach for mining quantitative association rules. The aim of

More information

Self-duality of bounded monotone boolean functions and related problems

Self-duality of bounded monotone boolean functions and related problems Discrete Applied Mathematics 156 (2008) 1598 1605 www.elsevier.com/locate/dam Self-duality of bounded monotone boolean functions and related problems Daya Ram Gaur a, Ramesh Krishnamurti b a Department

More information

Mining Positive and Negative Fuzzy Association Rules

Mining Positive and Negative Fuzzy Association Rules Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing

More information

Theory of Dependence Values

Theory of Dependence Values Theory of Dependence Values ROSA MEO Università degli Studi di Torino A new model to evaluate dependencies in data mining problems is presented and discussed. The well-known concept of the association

More information

An Efficient Implementation of a Joint Generation Algorithm

An Efficient Implementation of a Joint Generation Algorithm An Efficient Implementation of a Joint Generation Algorithm E. Boros 1, K. Elbassioni 1, V. Gurvich 1, and L. Khachiyan 2 1 RUTCOR, Rutgers University, 640 Bartholomew Road, Piscataway, NJ 08854-8003,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.

More information

Maintaining Frequent Itemsets over High-Speed Data Streams

Maintaining Frequent Itemsets over High-Speed Data Streams Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,

More information

Mining Approximative Descriptions of Sets Using Rough Sets

Mining Approximative Descriptions of Sets Using Rough Sets Mining Approximative Descriptions of Sets Using Rough Sets Dan A. Simovici University of Massachusetts Boston, Dept. of Computer Science, 100 Morrissey Blvd. Boston, Massachusetts, 02125 USA dsim@cs.umb.edu

More information

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1 Outline n Association Rule

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

Frequent Itemset Mining

Frequent Itemset Mining ì 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM COCONUT Team (PART I) IMAGINA 17/18 Webpage: http://www.lirmm.fr/~lazaar/teaching.html Email: lazaar@lirmm.fr 2 Data Mining ì Data Mining (DM) or Knowledge

More information

Removing trivial associations in association rule discovery

Removing trivial associations in association rule discovery Removing trivial associations in association rule discovery Geoffrey I. Webb and Songmao Zhang School of Computing and Mathematics, Deakin University Geelong, Victoria 3217, Australia Abstract Association

More information

Association Rules. Fundamentals

Association Rules. Fundamentals Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions. Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

Summarizing Transactional Databases with Overlapped Hyperrectangles

Summarizing Transactional Databases with Overlapped Hyperrectangles Noname manuscript No. (will be inserted by the editor) Summarizing Transactional Databases with Overlapped Hyperrectangles Yang Xiang Ruoming Jin David Fuhry Feodor F. Dragan Abstract Transactional data

More information

Approximating a Collection of Frequent Sets

Approximating a Collection of Frequent Sets Approximating a Collection of Frequent Sets Foto Afrati National Technical University of Athens Greece afrati@softlab.ece.ntua.gr Aristides Gionis HIIT Basic Research Unit Dept. of Computer Science University

More information

Generating Partial and Multiple Transversals of a Hypergraph

Generating Partial and Multiple Transversals of a Hypergraph Generating Partial and Multiple Transversals of a Hypergraph Endre Boros 1, Vladimir Gurvich 2, Leonid Khachiyan 3, and Kazuhisa Makino 4 1 RUTCOR, Rutgers University, 640 Bartholomew Road, Piscataway

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 04 Association Analysis Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Reductions for Frequency-Based Data Mining Problems

Reductions for Frequency-Based Data Mining Problems Reductions for Frequency-Based Data Mining Problems Stefan Neumann University of Vienna Vienna, Austria Email: stefan.neumann@univie.ac.at Pauli Miettinen Max Planck Institute for Informatics Saarland

More information

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH M. De Cock C. Cornelis E. E. Kerre Dept. of Applied Mathematics and Computer Science Ghent University, Krijgslaan 281 (S9), B-9000 Gent, Belgium phone: +32

More information

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context.

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context. Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context. Murphy Choy Cally Claire Ong Michelle Cheong Abstract The rapid explosion in retail data calls for more effective

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization

More information

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion Outline Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Introduction Algorithm Apriori Algorithm AprioriTid Comparison of Algorithms Conclusion Presenter: Dan Li Discussion:

More information

Dual-Bounded Generating Problems: Weighted Transversals of a Hypergraph

Dual-Bounded Generating Problems: Weighted Transversals of a Hypergraph Dual-Bounded Generating Problems: Weighted Transversals of a Hypergraph E. Boros V. Gurvich L. Khachiyan K. Makino January 15, 2003 Abstract We consider a generalization of the notion of transversal to

More information

Pattern Space Maintenance for Data Updates. and Interactive Mining

Pattern Space Maintenance for Data Updates. and Interactive Mining Pattern Space Maintenance for Data Updates and Interactive Mining Mengling Feng, 1,3,4 Guozhu Dong, 2 Jinyan Li, 1 Yap-Peng Tan, 1 Limsoon Wong 3 1 Nanyang Technological University, 2 Wright State University

More information

Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules

Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules International Journal of Innovative Research in Computer Scien & Technology (IJIRCST) Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules Mir Md Jahangir

More information

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity Universität zu Lübeck Institut für Theoretische Informatik Lecture notes on Knowledge-Based and Learning Systems by Maciej Liśkiewicz Lecture 5: Efficient PAC Learning 1 Consistent Learning: a Bound on

More information

An Approach to Classification Based on Fuzzy Association Rules

An Approach to Classification Based on Fuzzy Association Rules An Approach to Classification Based on Fuzzy Association Rules Zuoliang Chen, Guoqing Chen School of Economics and Management, Tsinghua University, Beijing 100084, P. R. China Abstract Classification based

More information

Realization Plans for Extensive Form Games without Perfect Recall

Realization Plans for Extensive Form Games without Perfect Recall Realization Plans for Extensive Form Games without Perfect Recall Richard E. Stearns Department of Computer Science University at Albany - SUNY Albany, NY 12222 April 13, 2015 Abstract Given a game in

More information

A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms

A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms T. Vijayakumar 1, V.Nivedhitha 2, K.Deeba 3 and M. Sathya Bama 4 1 Assistant professor / Dept of IT, Dr.N.G.P College of Engineering

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

On Generating All Minimal Integer Solutions for a Monotone System of Linear Inequalities

On Generating All Minimal Integer Solutions for a Monotone System of Linear Inequalities On Generating All Minimal Integer Solutions for a Monotone System of Linear Inequalities E. Boros 1, K. Elbassioni 2, V. Gurvich 1, L. Khachiyan 2, and K. Makino 3 1 RUTCOR, Rutgers University, 640 Bartholomew

More information

An Incremental RNC Algorithm for Generating All Maximal Independent Sets in Hypergraphs of Bounded Dimension

An Incremental RNC Algorithm for Generating All Maximal Independent Sets in Hypergraphs of Bounded Dimension An Incremental RNC Algorithm for Generating All Maximal Independent Sets in Hypergraphs of Bounded Dimension E. Boros K. Elbassioni V. Gurvich L. Khachiyan Abstract We show that for hypergraphs of bounded

More information

Selecting a Right Interestingness Measure for Rare Association Rules

Selecting a Right Interestingness Measure for Rare Association Rules Selecting a Right Interestingness Measure for Rare Association Rules Akshat Surana R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad

More information

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany Mining Rank Data Sascha Henzgen and Eyke Hüllermeier Department of Computer Science University of Paderborn, Germany {sascha.henzgen,eyke}@upb.de Abstract. This paper addresses the problem of mining rank

More information

Editorial Manager(tm) for Data Mining and Knowledge Discovery Manuscript Draft

Editorial Manager(tm) for Data Mining and Knowledge Discovery Manuscript Draft Editorial Manager(tm) for Data Mining and Knowledge Discovery Manuscript Draft Manuscript Number: Title: Summarizing transactional databases with overlapped hyperrectangles, theories and algorithms Article

More information

Machine Learning: Pattern Mining

Machine Learning: Pattern Mining Machine Learning: Pattern Mining Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008 Pattern Mining Overview Itemsets Task Naive Algorithm Apriori Algorithm

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei And slides provide by Raymond

More information

About the relationship between formal logic and complexity classes

About the relationship between formal logic and complexity classes About the relationship between formal logic and complexity classes Working paper Comments welcome; my email: armandobcm@yahoo.com Armando B. Matos October 20, 2013 1 Introduction We analyze a particular

More information

Introduction to Complexity Theory

Introduction to Complexity Theory Introduction to Complexity Theory Read K & S Chapter 6. Most computational problems you will face your life are solvable (decidable). We have yet to address whether a problem is easy or hard. Complexity

More information

Constraint-Based Rule Mining in Large, Dense Databases

Constraint-Based Rule Mining in Large, Dense Databases Appears in Proc of the 5th Int l Conf on Data Engineering, 88-97, 999 Constraint-Based Rule Mining in Large, Dense Databases Roberto J Bayardo Jr IBM Almaden Research Center bayardo@alummitedu Rakesh Agrawal

More information

Mining Free Itemsets under Constraints

Mining Free Itemsets under Constraints Mining Free Itemsets under Constraints Jean-François Boulicaut Baptiste Jeudy Institut National des Sciences Appliquées de Lyon Laboratoire d Ingénierie des Systèmes d Information Bâtiment 501 F-69621

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

Discovering Non-Redundant Association Rules using MinMax Approximation Rules

Discovering Non-Redundant Association Rules using MinMax Approximation Rules Discovering Non-Redundant Association Rules using MinMax Approximation Rules R. Vijaya Prakash Department Of Informatics Kakatiya University, Warangal, India vijprak@hotmail.com Dr.A. Govardhan Department.

More information

Introduction. An Introduction to Algorithms and Data Structures

Introduction. An Introduction to Algorithms and Data Structures Introduction An Introduction to Algorithms and Data Structures Overview Aims This course is an introduction to the design, analysis and wide variety of algorithms (a topic often called Algorithmics ).

More information

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno, Tatsuya Asai 2 3, Yuzo Uchida 2, and Hiroki Arimura 2 National Institute of Informatics, 2--2, Hitotsubashi,

More information

A Global Constraint for Closed Frequent Pattern Mining

A Global Constraint for Closed Frequent Pattern Mining A Global Constraint for Closed Frequent Pattern Mining N. Lazaar 1, Y. Lebbah 2, S. Loudni 3, M. Maamar 1,2, V. Lemière 3, C. Bessiere 1, P. Boizumault 3 1 LIRMM, University of Montpellier, France 2 LITIO,

More information

Efficient discovery of statistically significant association rules

Efficient discovery of statistically significant association rules fficient discovery of statistically significant association rules Wilhelmiina Hämäläinen Department of Computer Science University of Helsinki Finland whamalai@cs.helsinki.fi Matti Nykänen Department of

More information

Effective Elimination of Redundant Association Rules

Effective Elimination of Redundant Association Rules Effective Elimination of Redundant Association Rules James Cheng Yiping Ke Wilfred Ng Department of Computer Science and Engineering The Hong Kong University of Science and Technology Clear Water Bay,

More information

Pushing Tougher Constraints in Frequent Pattern Mining

Pushing Tougher Constraints in Frequent Pattern Mining Pushing Tougher Constraints in Frequent Pattern Mining Francesco Bonchi 1 and Claudio Lucchese 2 1 Pisa KDD Laboratory, ISTI - C.N.R., Area della Ricerca di Pisa, Italy 2 Department of Computer Science,

More information

Levelwise Search and Borders of Theories in Knowledge Discovery

Levelwise Search and Borders of Theories in Knowledge Discovery Data Mining and Knowledge Discovery 1, 241 258 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Levelwise Search and Borders of Theories in Knowledge Discovery HEIKKI MANNILA

More information

A Study on Monotone Self-Dual Boolean Functions

A Study on Monotone Self-Dual Boolean Functions A Study on Monotone Self-Dual Boolean Functions Mustafa Altun a and Marc D Riedel b a Electronics and Communication Engineering, Istanbul Technical University, Istanbul, Turkey 34469 b Electrical and Computer

More information

Boolean Analyzer - An Algorithm That Uses A Probabilistic Interestingness Measure to find Dependency/Association Rules In A Head Trauma Data

Boolean Analyzer - An Algorithm That Uses A Probabilistic Interestingness Measure to find Dependency/Association Rules In A Head Trauma Data Boolean Analyzer - An Algorithm That Uses A Probabilistic Interestingness Measure to find Dependency/Association Rules In A Head Trauma Data Susan P. Imberman a, Bernard Domanski b, Hilary W. Thompson

More information

On Differentially Private Frequent Itemsets Mining

On Differentially Private Frequent Itemsets Mining On Differentially Private Frequent Itemsets Mining Chen Zeng University of Wisconsin-Madison zeng@cs.wisc.edu Jeffrey F. Naughton University of Wisconsin-Madison naughton@cs.wisc.edu Jin-Yi Cai University

More information

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS Arumugam G Senior Professor and Head, Department of Computer Science Madurai Kamaraj University Madurai,

More information

FARMER: Finding Interesting Rule Groups in Microarray Datasets

FARMER: Finding Interesting Rule Groups in Microarray Datasets FARMER: Finding Interesting Rule Groups in Microarray Datasets Gao Cong, Anthony K. H. Tung, Xin Xu, Feng Pan Dept. of Computer Science Natl. University of Singapore {conggao,atung,xuxin,panfeng}@comp.nus.edu.sg

More information

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms DATA MINING LECTURE 4 Frequent Itemsets, Association Rules Evaluation Alternative Algorithms RECAP Mining Frequent Itemsets Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset

More information

Dynamic Programming Approach for Construction of Association Rule Systems

Dynamic Programming Approach for Construction of Association Rule Systems Dynamic Programming Approach for Construction of Association Rule Systems Fawaz Alsolami 1, Talha Amin 1, Igor Chikalov 1, Mikhail Moshkov 1, and Beata Zielosko 2 1 Computer, Electrical and Mathematical

More information

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms Computer Science 385 Analysis of Algorithms Siena College Spring 2011 Topic Notes: Limitations of Algorithms We conclude with a discussion of the limitations of the power of algorithms. That is, what kinds

More information

Discovery of Frequent Word Sequences in Text. Helena Ahonen-Myka. University of Helsinki

Discovery of Frequent Word Sequences in Text. Helena Ahonen-Myka. University of Helsinki Discovery of Frequent Word Sequences in Text Helena Ahonen-Myka University of Helsinki Department of Computer Science P.O. Box 26 (Teollisuuskatu 23) FIN{00014 University of Helsinki, Finland, helena.ahonen-myka@cs.helsinki.fi

More information

Associa'on Rule Mining

Associa'on Rule Mining Associa'on Rule Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 4 and 7, 2014 1 Market Basket Analysis Scenario: customers shopping at a supermarket Transaction

More information

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Sánchez 18th September 2013 Detecting Anom and Exc Behaviour on

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 08 Boolean Matrix Factorization Rainer Gemulla, Pauli Miettinen June 13, 2013 Outline 1 Warm-Up 2 What is BMF 3 BMF vs. other three-letter abbreviations 4 Binary matrices, tiles,

More information

The Complexity of Mining Maximal Frequent Subgraphs

The Complexity of Mining Maximal Frequent Subgraphs The Complexity of Mining Maximal Frequent Subgraphs Benny Kimelfeld IBM Research Almaden kimelfeld@us.ibm.com Phokion G. Kolaitis UC Santa Cruz & IBM Research Almaden kolaitis@cs.ucsc.edu ABSTRACT A frequent

More information

Chapter 2 Quality Measures in Pattern Mining

Chapter 2 Quality Measures in Pattern Mining Chapter 2 Quality Measures in Pattern Mining Abstract In this chapter different quality measures to evaluate the interest of the patterns discovered in the mining process are described. Patterns represent

More information

Inferring minimal rule covers from relations

Inferring minimal rule covers from relations Inferring minimal rule covers from relations CLAUDIO CARPINETO and GIOVANNI ROMANO Fondazione Ugo Bordoni, Via B. Castiglione 59, 00142 Rome, Italy Tel: +39-6-54803426 Fax: +39-6-54804405 E-mail: carpinet@fub.it

More information

DUAL-BOUNDED GENERATING PROBLEMS: PARTIAL AND MULTIPLE TRANSVERSALS OF A HYPERGRAPH

DUAL-BOUNDED GENERATING PROBLEMS: PARTIAL AND MULTIPLE TRANSVERSALS OF A HYPERGRAPH DUAL-BOUNDED GENERATING PROBLEMS: PARTIAL AND MULTIPLE TRANSVERSALS OF A HYPERGRAPH ENDRE BOROS, VLADIMIR GURVICH, LEONID KHACHIYAN, AND KAZUHISA MAKINO Abstract. We consider two natural generalizations

More information

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Analytics Beyond OLAP. Prof. Yanlei Diao Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of

More information

Frequent Itemset Mining

Frequent Itemset Mining ì 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM COCONUT Team IMAGINA 16/17 Webpage: h;p://www.lirmm.fr/~lazaar/teaching.html Email: lazaar@lirmm.fr 2 Data Mining ì Data Mining (DM) or Knowledge Discovery

More information

Chapters 6 & 7, Frequent Pattern Mining

Chapters 6 & 7, Frequent Pattern Mining CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapters

More information

Association Analysis Part 2. FP Growth (Pei et al 2000)

Association Analysis Part 2. FP Growth (Pei et al 2000) Association Analysis art 2 Sanjay Ranka rofessor Computer and Information Science and Engineering University of Florida F Growth ei et al 2 Use a compressed representation of the database using an F-tree

More information

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard

More information

Selecting the Right Interestingness Measure for Association Patterns

Selecting the Right Interestingness Measure for Association Patterns Selecting the Right ingness Measure for Association Patterns Pang-Ning Tan Department of Computer Science and Engineering University of Minnesota 2 Union Street SE Minneapolis, MN 55455 ptan@csumnedu Vipin

More information

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181. Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität

More information

Concepts of a Discrete Random Variable

Concepts of a Discrete Random Variable Concepts of a Discrete Random Variable Richard Emilion Laboratoire MAPMO, Université d Orléans, B.P. 6759 45067 Orléans Cedex 2, France, richard.emilion@univ-orleans.fr Abstract. A formal concept is defined

More information

A An Overview of Complexity Theory for the Algorithm Designer

A An Overview of Complexity Theory for the Algorithm Designer A An Overview of Complexity Theory for the Algorithm Designer A.1 Certificates and the class NP A decision problem is one whose answer is either yes or no. Two examples are: SAT: Given a Boolean formula

More information

CS264: Beyond Worst-Case Analysis Lecture #18: Smoothed Complexity and Pseudopolynomial-Time Algorithms

CS264: Beyond Worst-Case Analysis Lecture #18: Smoothed Complexity and Pseudopolynomial-Time Algorithms CS264: Beyond Worst-Case Analysis Lecture #18: Smoothed Complexity and Pseudopolynomial-Time Algorithms Tim Roughgarden March 9, 2017 1 Preamble Our first lecture on smoothed analysis sought a better theoretical

More information