Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Size: px
Start display at page:

Download "Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies"

Transcription

1 Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Juliano Brito da Justa Neves 1 Marina Teresa Pires Vieira {juliano,marina}@dc.ufscar.br Computer Science Department Federal University of São Carlos UFSCAR Caixa Postal 676, São Carlos SP Brazil Abstract Data mining is the process of discovering useful and previously unknown patterns in large datasets. Discovering association rules between items in large datasets is one such data mining task. A discovered rule is considered interesting if it brings new and useful information about the data. In this paper we show that if a dataset can be divided into subclasses and analyzed as if it was a generalization/specialization hierarchy, the class/subclass relationship can lead to the discovery of class-dependent rules showing interesting differences between the behavior of the whole dataset and the behavior of each of its subclasses. These differences are extracted from the association rules mined from the whole dataset and from each subclass separately. Using the concept of generalization/specialization hierarchies, comparisons and tests of statistical hypotheses, we present methods that are able to efficiently discover class-dependent rules and which classes have a higher influence on each rule. They also show when a given stimulus can provoke on one class a response that was not expected if only the undivided dataset was analyzed. Keywords: Data mining; Generalization/Specialization hierarchies; Class-dependent rules; Interesting rules 1 Introduction One of the key aspects of data mining is that the discovered knowledge should be interesting, the term interesting meaning that the discovered knowledge is useful and previously unknown. One of the most studied data mining tasks is the association rule mining, first introduced in [1]. Association rules identify items that are most likely to occur together in a transaction on a significant fraction of all transactions in a dataset. Every rule must satisfy two specified constrains, support, which measures its statistical significance, and confidence, which measures its strength. Also, there are other interest measures besides support and confidence that are used to discover even more interesting association rules. The objective of this paper is not to propose new methods for mining interesting association rules. Using the concept of class/subclass hierarchies generalization/specialization hierarchies), comparisons and tests of statistical hypotheses, we present methods for discovering interesting knowledge that are not being focused by current data mining techniques. Focusing on finding differences that can result in interesting knowledge between association rules discovered from the class/subclass hierarchy, we show in this paper how to identify class-dependent rules, in order to state if a given behavior of the entire population under study occurs in the same manner in all its subclasses and if a given 1 Mphil scholarship CAPES/Brazil

2 stimulus can lead to different class-dependent responses. For instance, suppose a hospital database in which the rule aspirin improved_health_status holds. It is interesting to know if for a given subclass of the whole dataset, like pregnant women over 40 years old, the same rule holds with support and confidence values so different it can be concluded that this subclass does not follow the behavior of the whole dataset with respect to this rule. Also, it is interesting to know if for a given subclass there are rules with the same stimulus aspirin) having different responses, which were not expected for the whole dataset, like aspirin lower_heart_beat_rate, which may hold only for the pregnant women over 40 years old and not for whole dataset, making explicit that this subclass have a class-dependent response to aspirin. In the next section we review the necessary background for studying the association rule problem and present some current work about interest measures. In section 3 we present the methods and theory for mining class-dependent rules using the concept of generalization/specialization hierarchies and tests of statistical hypotheses. In section 4 we present our conclusions and future work. 2 Background The association rule can be formally stated as follows [1, 2]: Let I = {i 1, i 2,, i m } be a set of distinct attributes, also called items. Let D be a dataset of transactions. Each transaction contains a set of items i i, i j,, i k I. An association rule can be represented in the form X Y, where X, Y I and X Y = Ø. X is called the antecedent and Y the consequent of the rule. A set of items is called an itemset, where for each nonnegative integer k an itemset with k items is called a k-itemset. The support of a given itemset is the percentage of transactions in D in which the given itemset occurs as a subset of the transaction. The confidence of a given itemset is defined as the ratio supportxuy)/supportx), and represents the percentage of all transactions containing X that also contains Y. The task of mining association rules can be decomposed into 2 subtasks: 1. Generate all itemsets with support above an user-specified support threshold. These itemsets are called large itemsets. All others are called small itemsets. 2. For each large itemset, all the rules with confidence above a user-specified confidence threshold are generated as follows: for a large itemset X and any Y X, if supportx)/supportx-y) minimum confidence, then the rule X-Y Y is a valid rule. The support and confidence thresholds were the first interest measures used in the task of mining association rules. Current work deals with discovering association rules using other interest measures besides support and confidence. In [3], association rules that identify correlations are mined, and the presence and absence of items are used as a basis for generating rules. The chi-squared test from classical statistics is used as the measure of significance of associations. In [4], support is still used as an interest measure, however, instead of confidence, a metric called conviction is used. This metric is a measure of implication rather than only cooccurrence. In [5], measures of interest namely entropy gain mutual information), gini index mean squared error) and chi-squared correlation) are used to find association rules that segment large categorical datasets into two parts that are optimal according to some objective function. In [6], a support/confidence border is used to discover association rules that satisfy a number of interest measures, such as support, confidence, entropy, chi-squared and conviction. In [7], confidence based interest measures named any-

3 confidence, all-confidence and bond are used to discover interesting association rules and [8] presents a survey of other interest measures. Most of the current work on interest measures aims at discovering interesting rules by analyzing the relationships between the items that compose them, which are variable according to the rule being analyzed, and the whole dataset of transactions, which is the same for all rules. Our work presents a different approach, where the same rule can be found in different datasets class/subclass) of transactions, and the differences in the relationship between the rule and each dataset it occurs are used as a measure of interest. 3 Mining Class-Dependent Rules Our approach can be described as follows: Let D be a dataset of transactions. According to some user defined criteria, D can be divided into k subclasses D i, where D 1 U D 2 U U D k = D and D 1 I D 2 I I D k = Ø. Using the concept of generalization/specialization hierarchies, D is said to be specialized into k subclasses or, reciprocally, D 1, D 2,, D k are being generalized into class D, as shown in Figure 1. This hierarchy is disjoint and total, so it can be stated that each transaction in D is found in only one subclass and all association rules from each subclass are found in D, possibly with different support and confidence values. If D already presents a generalization/specialization hierarchy, the proposed approach naturally adapts itself to it. Figure 1 - Generalization/Specialization Hierarchy Let r be the association rule X Y. Analyzing r with respect to D, r has support s D, where s D is the number of transactions in D in which r holds, and confidence c D, where c D is the number of transactions in D which contains X and Y divided by the number of transactions in D which contains X. Backed up by the concept of generalization/specialization hierarchies, it can be stated that r occurs in all k subclasses of D, with support s Di, where s Di is the number of transactions in D i in which r holds and s D1 + s D2 + + s Dk = s D, and confidence c Di, where c Di is the number of transactions in D i which contains X and Y divided by the number of transactions in D i which contains X. Let n D be the total number of transactions in D and n Di be the total number of transactions in each D i and n D1 + n D2 + + n Dk = n D. Next the two kinds of class-dependent rules are defined: Definition 1: Same Stimulus, Same Response, Different Support and/or Confidence Values) The rule r is defined class-dependent and, therefore, interesting, if r holds for D and if the ratio s D /n D is significantly different from s Di / n Di and/or c Di is significantly different from c D, for a given subclass D i and according to tests of statistical hypotheses.

4 From statistics, if a population follows a given distribution, a random sample from it will follow the same distribution. For a given rule, if the tests of statistical hypotheses identify differences in the support and/or confidence values, the subclass does not follow the same distribution of the class with respect to this rule. In this case the sample is not random and indicates that there is a dependency between the rule and a given class, so the rule is classdependent. Definition 2: Same Stimulus, Different Responses) The rule X Y is class-dependent and, therefore, interesting, if for a specified minimum support and confidence: i. X is frequent in D and in D i, ii. X Y holds for D i, but does not hold for D. These rules state that for the same stimulus the antecedent of the rule), there can be different responses the consequent of the rule) and those responses are class-dependent, since those responses only occurs in a specific subclass, and not for the entire dataset. On both definitions, each rule and stimulus must be tested against all subclasses D i, since it is interesting to know regarding which subclass a rule is considered class-dependent. In cases with more than 2 subclasses, it is possible for one class to follow the behavior of the whole dataset while the others subclasses does not follow, creating rules that are classdependent regarding some subclasses and not class-dependent regarding others subclasses. The description of tests of statistical hypotheses and the analysis of the same stimulus having different responses are presented in the following sections. 3.1 Problem 1: Same Stimulus, Same Response, Different Support and/or Confidence Values The problem of mining class-dependent rules can be solved as follows: First, the dataset of transactions and its subclasses are analyzed assuming that they follow a binomial probabilistic model. The binomial probabilistic model is the most frequently found model encountered among discrete probability functions, and it is associated with repeated trials of an experiment which has only two classes of outcomes, such as success or failure [9]. The dataset of transactions is the experiment, and each transaction is a trial of the experiment. In this case, the two possible outcomes for each trial are the rule holds and the rule does not hold. Using any association rule mining algorithm, for instance Apriori [2], FP-growth [10] or ECLAT [11], mine all the association rules that satisfy the user-specified minimum support and confidence from the dataset of transactions. For each rule r mined, find the same rule in each subclass of D. Notice that the rule may have different support and confidence values for each subclass. With the number of transactions in D and in each subclass and the support and confidence values for D and for each subclass, use the test of statistical hypotheses for the binomial probabilistic model to check for statistically significant differences in support and confidence values. The algorithm for mining class-dependent rules is shown in Figure 2; the test of statistical hypotheses and the confidence coefficient α used in the algorithm will be explained next. Algorithm: Class-dependent rules Input: Dataset D, its subclasses D i, confidence coefficient α, support s and confidence c Output: The list of class-dependent rules L r 1. Using a association rule mining algorithm, mine all rules satisfying s and c from D 2. For each rule r mined 3. For each subclass D i

5 4. Mine rule r from D i 5. Test rule r for support and confidence differences using α 6. If r is different according to α, add r to L r 7. end 8. end 9. return L r Figure 2 -Algorithm for finding class-dependent rules From statistics, in order to test two proportions p 1 and p 2, it is required the formulation of two hypotheses, H 0 the null hypothesis) and H 1 the alternative hypothesis). The hypotheses for testing if p 1 and p 2 are different are: H 0 : p 1 = p 2 H 1 : p 1 p 2 In order to accept the alternative hypothesis it is necessary to prove that the null hypothesis does not hold. For this purpose, the following expression must be evaluated: p1 p2 1) p2 1 p2 ) N n ) ) n N 1 where p 1 is the proportion observed in the sample being analyzed, p 2 is the same proportion observed in the population from where the sample was extracted, n is the size of the sample and N is the size of the population. If the value of z is less than -Z α/2 or greater than +Z α/2, where Z α/2 depends on the confidence coefficient α, we can reject the null hypothesis and accept the alternative one, as shown in Figure 3. Figure 3 - Acceptance and rejection areas It must be noticed that accepting a hypothesis does not mean that it is true, it just mean that there is not enough evidence that it is false [9]. Table 1 shows the most frequently used confidence coefficients and their corresponding Z α/2 values according to [9]: 1 α Z α/ Table 1 - Frequently used confidence coefficients and corresponding Z α/2 values [9]

6 Our hypothesis is that the support and/or the confidence of a given rule are different in each subclass from the support and/or confidence of the same rule in the whole dataset, indicating that the rule behavior is dependent on the subclass it occurs. In order to analyze the differences between support values, we replace p 1 in 1) with the support value for D i and p 2 with the support value for D, resulting in sdi sd 2) sd 1 sd ) N n ) ) n N 1 In order to analyze the differences between confidence values, we replace p 1 in 1) with the confidence value for D i and p 2 with the confidence value for D resulting in cdi cd 3) cd 1 cd ) N n ) ) n N 1 If a rule is class independent it will occur with the same support and confidence in each subclass, and the tests will accept the null hypothesis for support and confidence. 3.2 Problem 2: Same Stimulus, Different Responses The problem of mining association rules showing the same stimulus the antecedent of the rule) having different responses the consequent of the rule) depending on its subclass can be solved as follows: First, using an association rule mining algorithm, mine all the association rules that satisfy the user-specified minimum support and confidence from the dataset of transactions. For each different stimulus X the antecedent of a rule in the form X Y) mined from the class, search each subclass for rules X W where Y W with the following constraints: For X W, number of transactions in D i in which the rule holds)/number of transactions in D i ) minimum support; For X W, number of transactions in D in which the rule holds)/number of transactions in D) < minimum support; For X W, confidence Di minimum confidence, where confidence Di is the confidence with respect to D i. These constraints guarantee that the rule X W is class-dependent, once it does not hold for D but it does hold for D i. If for X W the ratio number of transactions in D i in which the rule holds)/number of transactions in D i ) is less than the minimum support and/or its confidence is less than the minimum confidence, the rule X W is not significant enough in this subclass and therefore is not considered. The algorithm for finding the same stimulus having different responses is shown in Figure 4. Algorithm: Find the same stimulus having different responses Input: Dataset D, its subclasses D i, support s and confidence c Output: The list stimulus and its class-dependent responses L X, r 1. Using a association rule mining algorithm, mine all rules satisfying s and c from D 2. For each subclass D i 3. Mine all rules satisfying s and c from D i 4. end 5. For each stimulus X mined from D 6. For each subclass D i

7 7. Find all rules with X as a stimulus 8. For each rule r found, check if this rule holds for D 9. If r does not hold for D, add the pair [X, r] to L X, r 10. end 11. end 12. Return L X, r Figure 4 - Find the same stimulus having different responses The returned rules are the class-dependent rules for a given stimulus found in D. 4 Examples and Experiments In this section it is presented an example with synthetic data, in order to show the usage of the methods presented, and an experiment with a public available dataset. 4.1 Analysis of Synthetic data Let Table 2 be the dataset of transactions D and Table 3 the rules mined with minimum support = 20% and minimum confidence = 80%, where s and c are respectively the support and the confidence of the rules. Transaction ID Items in transaction Transaction ID Items in transaction B D E F A B C F A B D F C D F G A B E F A E F G B C D E A B D G 1) E B s=45.0% c=81.8% 2) E F s=45.0% c=81.8% 3) C A B s=20.0% c=100.0% Table 2 - Dataset of transactions D Rules mined Table 3 - Rules mined A B C F A E F G B C E F A D F G C D F G A B C G B C E G A B C G 4) E G F s=30.0% c=85.7% 5) E G B F s=20.0% c=80.0% 6) G B F E s=20.0% c=100.0% Suppose that for a given user criteria, D is divided into two subclasses, D 1 and D 2. Table 4 shows subclass D 1 and Table 5 shows subclass D 2. Despite the example, the subclasses do not need to have the same number of transactions. Transaction ID Items in transaction Transaction ID Items in transaction A B C F A E F G A B C F A B D F A D F G B C E F B D E F C D F G

8 A B E F A E F G A B C G A B D G A B C G C D F G B C D E B C E G Table 4 Subclass D 1 Table 5 Subclass D 2 As an example, the analyses of rules 1 and 2 from Table 3 are presented next, with α = The rule 1 occurs in subclass D 1 with support = 10.0%. Using formula 2) to find differences between support values Problem 1), = ) ) ) which is less than So the null hypothesis is rejected and we can say that rule 1 does not have the same support value in D and in D 1, indicating that D 1 does have a behavior different from D with respect to rule 1. The same applies to D 2, where the same rule appears with support = 80.0%, resulting in a z value of +3.07, which is greater than If the support values were equal by the test of statistical hypotheses, the test regarding confidence would be used. Testing rule 1 for different responses to the same stimulus Problem 2), we find in D 1 the rule E A with support = 30.0% and confidence = 100.0%. This rule does not hold for D, but does hold for D 1. In D 2 there are no rules with E as a stimulus with support and confidence above the specified threshold except for E F, but this rule does hold for D. This test is useful because analyzing D we expect the stimulus E to provoke the response B, and it does happen, but for D 1, the stimulus E provokes other class-dependent response that were not expected if only D have been analyzed. Note that we do not have to test rule 2 for different responses to the same stimulus, since its stimulus is equal to one already tested. For rule number 2, there is in subclass D 1 the same rule with support = 30% and confidence = 100.0%. Using formula 2) to find differences between support values Problem 1), = ) ) ) which allows the acceptation of the null hypothesis for support. Using formula 3) to find differences between confidence values Problem 1), = ) ) ) which allows the rejection of the null hypothesis for confidence. In subclass D 2 the same rule appears with support = 60.0% and confidence = 75.0%. Using formula 2) to find differences between support values Problem 1),

9 = ) ) ) which allows the acceptation of the null hypothesis for support and using formula 3) to find differences between confidence values Problem 1), = ) ) ) which allows the acceptation of the null hypothesis for confidence. These results for D 1 and D 2 states that rule 2 is class-dependent regarding confidence only for D 1. These results from the analyses of rule 1 and 2 can offer relevant information, for instance, considering a real situation, suppose that D is a hospital dataset and each item is a symptom for a given disease and the association rules mined shows relationships among the symptoms. If the hospital staff desires to reduce the occurrence of the symptom B by treating symptom E, it would be interesting to know that the associated symptom B is classdependent, and subclass D 2 will benefit more from the treatment than subclass D 1, where the symptom A would also be affected. Also, it would be interesting to know that if the symptom E is treated, the occurrence of symptom F would be affected differently on each subclass. To the best of our knowledge all this information cannot be discovered by traditional association rule mining algorithms. 4.2 Analysis of a Public available dataset Using the voting-records dataset from [12] as a real example, dividing its 435 transactions into two subclasses, republican 168 transactions) and democrat 267 transactions), using support = 20% and confidence = 80% and analyzing only foreign politics votes 2 el-salvadoraid, aid-to-nicaraguan-contras, immigration, duty-free-exports and export-administration-actsouth-africa), we discovered that all 25 discovered association rules from the whole dataset are class-dependent rules, according to definition 1, and only 1 discovered association rule from the whole dataset is class-dependent according to definition 2. For instance, the rule aidto-nicaraguan-contras=yes el-salvador-aid=no with support = 46.9% and confidence = 84.3%, is analyzed next. This rule occurs in the republican subclass with support = 4.2% and confidence = 29.2% and in the democrat subclass with support = 73.8% and confidence = 90.4%. Using formula 2) to find differences between support values Problem 1): z republican = = ) ) ) z democrat = = ) ) ) which allows the rejection of the null hypothesis on both cases. Empirical tests show that for most real datasets it is very difficult for the user to divide it in a way the rules end up being 2 In order to reduce the number of discovered association rules from more than rules to only 25 and keep the dataset focused on only one single subject foreign politics).

10 considered class independent, so it is interesting to rank the subclasses in order to know which subclass has the most different behavior considering a given rule. This rank is done analyzing the absolute value of z. In the example of the voting dataset z republican = and z democrat = 14.16, showing that the democrat have the most different behavior from all subclasses. This is particularly useful when there are more than two subclasses. Continuing the analysis for this rule, analyzing only the entire dataset of foreign politics votes we could end up assuming that almost half of the congressmen would support the rule independent of his political affiliation, but it is clear that this rule is endorsed only by the democrats. Of course this information could be retrieved from the entire dataset if in each transaction the political affiliation was present, what could result in a rule of the form aid-to- Nicaraguan-contras=yes democrat=yes el-salvador-aid=no, but the dataset can be divided according to any user criteria, not only the presence or absence of a given item. Finally the stimulus do not provoke any other response in each subclass for support = 20% and confidence = 80% Problem 2). Another rule is el-salvador-aid=yes immigration=yes duty-free-exports=no with support = 20.2% and confidence = 82.2%. This rule appears in the republican subclass with support = 42.9% and confidence = 84.7% and in the democrat subclass with support = 6.0% and confidence = 72.7%. Using formula 2) to find differences between support values Problem 1): z republican = = ) ) ) z democrat = = ) ) ) which allows the rejection of the null hypotheses on both cases, with the republicans having the most different behavior. Testing for the same stimulus having different responses, we find in the republican subclass the rule el-salvador-aid=yes immigration=yes aid-to- Nicaraguan-contras=no with support = 42.9% and confidence = 84.7%. This rule holds for the republican subclass but does not hold for the entire dataset showing that the same stimulus provoked a different response dependent on the republican subclass. 5 Conclusions This paper introduced methods for mining association rules that make explicit the differences in behavior from the whole dataset and its subclasses. These methods are able to efficiently discover when a given association rule is class-dependent and what are the classes that have a higher influence on each rule. They also show when a given stimulus can provoke on one class a response that was not expected if only the undivided dataset was analyzed. Also, this analysis can be extremely useful in cases where a generalization/specialization hierarchy already exists, when this analysis can lead to the same stimulus having different responses and the items of this responses occurs only in the subclasses. Furthermore, the presented methods can be incorporated into current association rule mining algorithms. To the best of our knowledge, there are no current data mining algorithms implementing similar methods, or able to discover the same information.

11 References [1] Agrawal, R.; Imielinski, T.; Swami, A. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, Washington, D.C., USA, 1993 [2] Agrawal, R.; Srikant, R. Fast algorithms for mining association rules. In Proc. of the Int'l Conf. on Very Large Databases, Santiago de Chile, Chile, 1994 [3] Brin, S.; Motwani, R.; Silverstein, C. Beyond market baskets: generalizing association rules to correlations. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, Tucson, Arizona, USA, 1997 [4] Brin, S.; Motwani, R.; Ullman, J.; Tsur, S. Dynamic itemset counting and implication rules for market basket data. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, Tucson, Arizona, USA, 1997 [5] Morimoto, Y.; Fukuda, T.; Matsuzawa, H.; Tkuyama, T.; Yoda, K. Algorithms for mining association rules for binary segmentation of huge categorical databases. In Proc. of the Int'l Conf. on Very Large Databases, New York City, NY, USA, 1998 [6] Bayardo, R.; Agrawal, R. Mining the most interesting rules. In Proc. of the ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, San Diego, CA, USA, 1999 [7] Omiecinski, E. Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering, vol. 15 no. 1, January/February 2003 [8] Hilderman, R.J.; Hamilton, H.J. Knowledge discovery and interestingness measures: a survey. Technical Report CS 99-04, Department of Computer Science, University of Regina, Canada, 1999 [9] Hughes, A.; Grawoig, D. Statistics: A foundation for analysis. Addison-Wesley, 1971 [10] Han, J.; Pei, J.; Yin, Y. Mining frequent patterns without candidate generation. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, Dallas, Texas, USA, 2000 [11] Zaki, M.; Parthasarathy, S.; Ogihara, M.; Li, W. New algorithms for fast discovery of association rules. In Proc. of the ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, Newport Beach, CA, USA, 1997 [12] Blake, C.L.; Merz, C.J. UCI Repository of machine learning databases [ University of California, Department of Information and Computer Science, 1998

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen Book Title Encyclopedia of Machine Learning Chapter Number 00403 Book CopyRight - Year 2010 Title Frequent Pattern Author Particle Given Name Hannu Family Name Toivonen Suffix Email hannu.toivonen@cs.helsinki.fi

More information

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.

More information

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Association Rule. Lecturer: Dr. Bo Yuan. LOGO Association Rule Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Overview Frequent Itemsets Association Rules Sequential Patterns 2 A Real Example 3 Market-Based Problems Finding associations

More information

An Overview of Alternative Rule Evaluation Criteria and Their Use in Separate-and-Conquer Classifiers

An Overview of Alternative Rule Evaluation Criteria and Their Use in Separate-and-Conquer Classifiers An Overview of Alternative Rule Evaluation Criteria and Their Use in Separate-and-Conquer Classifiers Fernando Berzal, Juan-Carlos Cubero, Nicolás Marín, and José-Luis Polo Department of Computer Science

More information

An Approach to Classification Based on Fuzzy Association Rules

An Approach to Classification Based on Fuzzy Association Rules An Approach to Classification Based on Fuzzy Association Rules Zuoliang Chen, Guoqing Chen School of Economics and Management, Tsinghua University, Beijing 100084, P. R. China Abstract Classification based

More information

Guaranteeing the Accuracy of Association Rules by Statistical Significance

Guaranteeing the Accuracy of Association Rules by Statistical Significance Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge

More information

Mining Positive and Negative Fuzzy Association Rules

Mining Positive and Negative Fuzzy Association Rules Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing

More information

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli

More information

Removing trivial associations in association rule discovery

Removing trivial associations in association rule discovery Removing trivial associations in association rule discovery Geoffrey I. Webb and Songmao Zhang School of Computing and Mathematics, Deakin University Geelong, Victoria 3217, Australia Abstract Association

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

CS 584 Data Mining. Association Rule Mining 2

CS 584 Data Mining. Association Rule Mining 2 CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M

More information

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Claudio Lucchese Ca Foscari University of Venice clucches@dsi.unive.it Raffaele Perego ISTI-CNR of Pisa perego@isti.cnr.it Salvatore

More information

Standardizing Interestingness Measures for Association Rules

Standardizing Interestingness Measures for Association Rules Standardizing Interestingness Measures for Association Rules arxiv:138.374v1 [stat.ap] 16 Aug 13 Mateen Shaikh, Paul D. McNicholas, M. Luiza Antonie and T. Brendan Murphy Department of Mathematics & Statistics,

More information

On Minimal Infrequent Itemset Mining

On Minimal Infrequent Itemset Mining On Minimal Infrequent Itemset Mining David J. Haglin and Anna M. Manning Abstract A new algorithm for minimal infrequent itemset mining is presented. Potential applications of finding infrequent itemsets

More information

Quantitative Association Rule Mining on Weighted Transactional Data

Quantitative Association Rule Mining on Weighted Transactional Data Quantitative Association Rule Mining on Weighted Transactional Data D. Sujatha and Naveen C. H. Abstract In this paper we have proposed an approach for mining quantitative association rules. The aim of

More information

Selecting a Right Interestingness Measure for Rare Association Rules

Selecting a Right Interestingness Measure for Rare Association Rules Selecting a Right Interestingness Measure for Rare Association Rules Akshat Surana R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad

More information

Association Rule Discovery with Unbalanced Class Distributions

Association Rule Discovery with Unbalanced Class Distributions Association Rule Discovery with Unbalanced Class Distributions Lifang Gu 1, Jiuyong Li 2, Hongxing He 1, Graham Williams 1, Simon Hawkins 1, and Chris Kelman 3 1 CSIRO Mathematical and Information Sciences

More information

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation 12.3.2008 Lauri Lahti Association rules Techniques for data mining and knowledge discovery in databases

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

CPDA Based Fuzzy Association Rules for Learning Achievement Mining

CPDA Based Fuzzy Association Rules for Learning Achievement Mining 2009 International Conference on Machine Learning and Computing IPCSIT vol.3 (2011) (2011) IACSIT Press, Singapore CPDA Based Fuzzy Association Rules for Learning Achievement Mining Jr-Shian Chen 1, Hung-Lieh

More information

Association Rules. Fundamentals

Association Rules. Fundamentals Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions. Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

A Concise Representation of Association Rules using Minimal Predictive Rules

A Concise Representation of Association Rules using Minimal Predictive Rules A Concise Representation of Association Rules using Minimal Predictive Rules Iyad Batal and Milos Hauskrecht Department of Computer Science University of Pittsburgh {iyad,milos}@cs.pitt.edu Abstract. Association

More information

732A61/TDDD41 Data Mining - Clustering and Association Analysis

732A61/TDDD41 Data Mining - Clustering and Association Analysis 732A61/TDDD41 Data Mining - Clustering and Association Analysis Lecture 6: Association Analysis I Jose M. Peña IDA, Linköping University, Sweden 1/14 Outline Content Association Rules Frequent Itemsets

More information

Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength

Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength Xiangjun Dong School of Information, Shandong Polytechnic University Jinan 250353, China dongxiangjun@gmail.com

More information

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany Mining Rank Data Sascha Henzgen and Eyke Hüllermeier Department of Computer Science University of Paderborn, Germany {sascha.henzgen,eyke}@upb.de Abstract. This paper addresses the problem of mining rank

More information

Connections between mining frequent itemsets and learning generative models

Connections between mining frequent itemsets and learning generative models Connections between mining frequent itemsets and learning generative models Srivatsan Laxman Microsoft Research Labs India slaxman@microsoft.com Prasad Naldurg Microsoft Research Labs India prasadn@microsoft.com

More information

Mining Strong Positive and Negative Sequential Patterns

Mining Strong Positive and Negative Sequential Patterns Mining Strong Positive and Negative Sequential Patter NANCY P. LIN, HUNG-JEN CHEN, WEI-HUA HAO, HAO-EN CHUEH, CHUNG-I CHANG Department of Computer Science and Information Engineering Tamang University,

More information

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1 Outline n Association Rule

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Preserving Privacy in Data Mining using Data Distortion Approach

Preserving Privacy in Data Mining using Data Distortion Approach Preserving Privacy in Data Mining using Data Distortion Approach Mrs. Prachi Karandikar #, Prof. Sachin Deshpande * # M.E. Comp,VIT, Wadala, University of Mumbai * VIT Wadala,University of Mumbai 1. prachiv21@yahoo.co.in

More information

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Sánchez 18th September 2013 Detecting Anom and Exc Behaviour on

More information

Chapters 6 & 7, Frequent Pattern Mining

Chapters 6 & 7, Frequent Pattern Mining CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapters

More information

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context.

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context. Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context. Murphy Choy Cally Claire Ong Michelle Cheong Abstract The rapid explosion in retail data calls for more effective

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Tan, Steinbach, Kumar And Jiawei Han, Micheline Kamber, and Jian Pei 1 10

More information

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS Arumugam G Senior Professor and Head, Department of Computer Science Madurai Kamaraj University Madurai,

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - 1DL36 Fall 212" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht12 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran 1. Let X, Y be two itemsets, and let denote the support of itemset X. Then the confidence of the rule X Y,

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei And slides provide by Raymond

More information

Mining Infrequent Patter ns

Mining Infrequent Patter ns Mining Infrequent Patter ns JOHAN BJARNLE (JOHBJ551) PETER ZHU (PETZH912) LINKÖPING UNIVERSITY, 2009 TNM033 DATA MINING Contents 1 Introduction... 2 2 Techniques... 3 2.1 Negative Patterns... 3 2.2 Negative

More information

A Novel Approach of Multilevel Positive and Negative Association Rule Mining for Spatial Databases

A Novel Approach of Multilevel Positive and Negative Association Rule Mining for Spatial Databases A Novel Approach of Multilevel Positive and Negative Association Rule Mining for Spatial Databases L.K. Sharma 1, O. P. Vyas 1, U. S. Tiwary 2, R. Vyas 1 1 School of Studies in Computer Science Pt. Ravishankar

More information

CS 412 Intro. to Data Mining

CS 412 Intro. to Data Mining CS 412 Intro. to Data Mining Chapter 6. Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Jiawei Han, Computer Science, Univ. Illinois at Urbana -Champaign, 2017 1 2 3

More information

Relation between Pareto-Optimal Fuzzy Rules and Pareto-Optimal Fuzzy Rule Sets

Relation between Pareto-Optimal Fuzzy Rules and Pareto-Optimal Fuzzy Rule Sets Relation between Pareto-Optimal Fuzzy Rules and Pareto-Optimal Fuzzy Rule Sets Hisao Ishibuchi, Isao Kuwajima, and Yusuke Nojima Department of Computer Science and Intelligent Systems, Osaka Prefecture

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 04 Association Analysis Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 2017 Road map The Apriori algorithm Step 1: Mining all frequent

More information

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries To appear in Data Mining and Knowledge Discovery, an International Journal c Kluwer Academic Publishers

More information

Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules

Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules International Journal of Innovative Research in Computer Scien & Technology (IJIRCST) Regression and Correlation Analysis of Different Interestingness Measures for Mining Association Rules Mir Md Jahangir

More information

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch Harvard University Geppino Pucci University of Padova Michael Mitzenmacher Harvard University Eli

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - DL360 Fall 200 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht0 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion Outline Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Introduction Algorithm Apriori Algorithm AprioriTid Comparison of Algorithms Conclusion Presenter: Dan Li Discussion:

More information

Constraint-Based Rule Mining in Large, Dense Databases

Constraint-Based Rule Mining in Large, Dense Databases Appears in Proc of the 5th Int l Conf on Data Engineering, 88-97, 999 Constraint-Based Rule Mining in Large, Dense Databases Roberto J Bayardo Jr IBM Almaden Research Center bayardo@alummitedu Rakesh Agrawal

More information

Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

ASSOCIATION RULE MINING AND CLASSIFIER APPROACH FOR QUANTITATIVE SPOT RAINFALL PREDICTION

ASSOCIATION RULE MINING AND CLASSIFIER APPROACH FOR QUANTITATIVE SPOT RAINFALL PREDICTION ASSOCIATION RULE MINING AND CLASSIFIER APPROACH FOR QUANTITATIVE SPOT RAINFALL PREDICTION T.R. SIVARAMAKRISHNAN 1, S. MEGANATHAN 2 1 School of EEE, SASTRA University, Thanjavur, 613 401 2 Department of

More information

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Analytics Beyond OLAP. Prof. Yanlei Diao Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of

More information

Boolean Analyzer - An Algorithm That Uses A Probabilistic Interestingness Measure to find Dependency/Association Rules In A Head Trauma Data

Boolean Analyzer - An Algorithm That Uses A Probabilistic Interestingness Measure to find Dependency/Association Rules In A Head Trauma Data Boolean Analyzer - An Algorithm That Uses A Probabilistic Interestingness Measure to find Dependency/Association Rules In A Head Trauma Data Susan P. Imberman a, Bernard Domanski b, Hilary W. Thompson

More information

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries Data Mining and Knowledge Discovery, 7, 5 22, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency

More information

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH M. De Cock C. Cornelis E. E. Kerre Dept. of Applied Mathematics and Computer Science Ghent University, Krijgslaan 281 (S9), B-9000 Gent, Belgium phone: +32

More information

Mining Free Itemsets under Constraints

Mining Free Itemsets under Constraints Mining Free Itemsets under Constraints Jean-François Boulicaut Baptiste Jeudy Institut National des Sciences Appliquées de Lyon Laboratoire d Ingénierie des Systèmes d Information Bâtiment 501 F-69621

More information

Anomalous Association Rules

Anomalous Association Rules Anomalous Association Rules Fernando Berzal, Juan-Carlos Cubero, Nicolás Marín Department of Computer Science and AI University of Granada Granada 18071 Spain {fberzal jc.cubero nicm}@decsai.ugr.es Matías

More information

Selecting the Right Interestingness Measure for Association Patterns

Selecting the Right Interestingness Measure for Association Patterns Selecting the Right ingness Measure for Association Patterns Pang-Ning Tan Department of Computer Science and Engineering University of Minnesota 2 Union Street SE Minneapolis, MN 55455 ptan@csumnedu Vipin

More information

Pattern Structures 1

Pattern Structures 1 Pattern Structures 1 Pattern Structures Models describe whole or a large part of the data Pattern characterizes some local aspect of the data Pattern is a predicate that returns true for those objects

More information

Finding Maximal Fully-Correlated Itemsets in Large Databases

Finding Maximal Fully-Correlated Itemsets in Large Databases Finding Maximal Fully-Correlated Itemsets in Large Databases Lian Duan Department of Management Sciences The University of Iowa Iowa City, IA 52241 USA lian-duan@uiowa.edu W. Nick Street Department of

More information

Chapter 2 Quality Measures in Pattern Mining

Chapter 2 Quality Measures in Pattern Mining Chapter 2 Quality Measures in Pattern Mining Abstract In this chapter different quality measures to evaluate the interest of the patterns discovered in the mining process are described. Patterns represent

More information

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border Guimei Liu a,b Jinyan Li a Limsoon Wong b a Institute for Infocomm Research, Singapore b School of

More information

Layered critical values: a powerful direct-adjustment approach to discovering significant patterns

Layered critical values: a powerful direct-adjustment approach to discovering significant patterns Mach Learn (2008) 71: 307 323 DOI 10.1007/s10994-008-5046-x TECHNICAL NOTE Layered critical values: a powerful direct-adjustment approach to discovering significant patterns Geoffrey I. Webb Received:

More information

Zijian Zheng, Ron Kohavi, and Llew Mason. Blue Martini Software San Mateo, California

Zijian Zheng, Ron Kohavi, and Llew Mason. Blue Martini Software San Mateo, California 1 Zijian Zheng, Ron Kohavi, and Llew Mason {zijian,ronnyk,lmason}@bluemartini.com Blue Martini Software San Mateo, California August 11, 2001 Copyright 2000-2001, Blue Martini Software. San Mateo California,

More information

Machine Learning: Pattern Mining

Machine Learning: Pattern Mining Machine Learning: Pattern Mining Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008 Pattern Mining Overview Itemsets Task Naive Algorithm Apriori Algorithm

More information

Analysis and summarization of correlations in data cubes and its application in microarray data analysis

Analysis and summarization of correlations in data cubes and its application in microarray data analysis Intelligent Data Analysis 9 (2005) 43 57 43 IOS Press Analysis and summarization of correlations in data cubes and its application in microarray data analysis Chien-Yu Chen, Shien-Ching Hwang and Yen-Jen

More information

Association Rule Mining on Web

Association Rule Mining on Web Association Rule Mining on Web What Is Association Rule Mining? Association rule mining: Finding interesting relationships among items (or objects, events) in a given data set. Example: Basket data analysis

More information

Discovering Non-Redundant Association Rules using MinMax Approximation Rules

Discovering Non-Redundant Association Rules using MinMax Approximation Rules Discovering Non-Redundant Association Rules using MinMax Approximation Rules R. Vijaya Prakash Department Of Informatics Kakatiya University, Warangal, India vijprak@hotmail.com Dr.A. Govardhan Department.

More information

Application of Apriori Algorithm in Open Experiment

Application of Apriori Algorithm in Open Experiment 2011 International Conference on Computer Science and Information Technology (ICCSIT 2011) IPCSIT vol. 51 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V51.130 Application of Apriori Algorithm

More information

Theory of Dependence Values

Theory of Dependence Values Theory of Dependence Values ROSA MEO Università degli Studi di Torino A new model to evaluate dependencies in data mining problems is presented and discussed. The well-known concept of the association

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population

More information

Pattern-Based Decision Tree Construction

Pattern-Based Decision Tree Construction Pattern-Based Decision Tree Construction Dominique Gay, Nazha Selmaoui ERIM - University of New Caledonia BP R4 F-98851 Nouméa cedex, France {dominique.gay, nazha.selmaoui}@univ-nc.nc Jean-François Boulicaut

More information

Density-Based Clustering

Density-Based Clustering Density-Based Clustering idea: Clusters are dense regions in feature space F. density: objects volume ε here: volume: ε-neighborhood for object o w.r.t. distance measure dist(x,y) dense region: ε-neighborhood

More information

Association Analysis Part 2. FP Growth (Pei et al 2000)

Association Analysis Part 2. FP Growth (Pei et al 2000) Association Analysis art 2 Sanjay Ranka rofessor Computer and Information Science and Engineering University of Florida F Growth ei et al 2 Use a compressed representation of the database using an F-tree

More information

Correlation Preserving Unsupervised Discretization. Outline

Correlation Preserving Unsupervised Discretization. Outline Correlation Preserving Unsupervised Discretization Jee Vang Outline Paper References What is discretization? Motivation Principal Component Analysis (PCA) Association Mining Correlation Preserving Discretization

More information

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms DATA MINING LECTURE 4 Frequent Itemsets, Association Rules Evaluation Alternative Algorithms RECAP Mining Frequent Itemsets Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset

More information

Dynamic Programming Approach for Construction of Association Rule Systems

Dynamic Programming Approach for Construction of Association Rule Systems Dynamic Programming Approach for Construction of Association Rule Systems Fawaz Alsolami 1, Talha Amin 1, Igor Chikalov 1, Mikhail Moshkov 1, and Beata Zielosko 2 1 Computer, Electrical and Mathematical

More information

Association Analysis. Part 1

Association Analysis. Part 1 Association Analysis Part 1 1 Market-basket analysis DATA: A large set of items: e.g., products sold in a supermarket A large set of baskets: e.g., each basket represents what a customer bought in one

More information

On Condensed Representations of Constrained Frequent Patterns

On Condensed Representations of Constrained Frequent Patterns Under consideration for publication in Knowledge and Information Systems On Condensed Representations of Constrained Frequent Patterns Francesco Bonchi 1 and Claudio Lucchese 2 1 KDD Laboratory, ISTI Area

More information

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. .. Cal Poly CSC 4: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Mining Association Rules Examples Course Enrollments Itemset. I = { CSC3, CSC3, CSC40, CSC40, CSC4, CSC44, CSC4, CSC44,

More information

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu Why? How? What? How much? How many? Individual facts (quantities, characters, or symbols) The Data-Information-Knowledge-Wisdom

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining

More information

Finding the True Frequent Itemsets

Finding the True Frequent Itemsets Finding the True Frequent Itemsets Matteo Riondato Fabio Vandin Wednesday 9 th April, 2014 Abstract Frequent Itemsets (FIs) mining is a fundamental primitive in knowledge discovery. It requires to identify

More information

Classification Based on Logical Concept Analysis

Classification Based on Logical Concept Analysis Classification Based on Logical Concept Analysis Yan Zhao and Yiyu Yao Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 E-mail: {yanzhao, yyao}@cs.uregina.ca Abstract.

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

On Multi-Class Cost-Sensitive Learning

On Multi-Class Cost-Sensitive Learning On Multi-Class Cost-Sensitive Learning Zhi-Hua Zhou and Xu-Ying Liu National Laboratory for Novel Software Technology Nanjing University, Nanjing 210093, China {zhouzh, liuxy}@lamda.nju.edu.cn Abstract

More information

Finding All Minimal Infrequent Multi-dimensional Intervals

Finding All Minimal Infrequent Multi-dimensional Intervals Finding All Minimal nfrequent Multi-dimensional ntervals Khaled M. Elbassioni Max-Planck-nstitut für nformatik, Saarbrücken, Germany; elbassio@mpi-sb.mpg.de Abstract. Let D be a database of transactions

More information

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules 10/19/2017 MIST6060 Business Intelligence and Data Mining 1 Examples of Association Rules Association Rules Sixty percent of customers who buy sheets and pillowcases order a comforter next, followed by

More information

InfoMiner: Mining Surprising Periodic Patterns

InfoMiner: Mining Surprising Periodic Patterns InfoMiner: Mining Surprising Periodic Patterns Jiong Yang IBM Watson Research Center jiyang@us.ibm.com Wei Wang IBM Watson Research Center ww1@us.ibm.com Philip S. Yu IBM Watson Research Center psyu@us.ibm.com

More information

Unsupervised Learning. k-means Algorithm

Unsupervised Learning. k-means Algorithm Unsupervised Learning Supervised Learning: Learn to predict y from x from examples of (x, y). Performance is measured by error rate. Unsupervised Learning: Learn a representation from exs. of x. Learn

More information

NetBox: A Probabilistic Method for Analyzing Market Basket Data

NetBox: A Probabilistic Method for Analyzing Market Basket Data NetBox: A Probabilistic Method for Analyzing Market Basket Data José Miguel Hernández-Lobato joint work with Zoubin Gharhamani Department of Engineering, Cambridge University October 22, 2012 J. M. Hernández-Lobato

More information

CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS. Due to elements of uncertainty many problems in this world appear to be

CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS. Due to elements of uncertainty many problems in this world appear to be 11 CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS Due to elements of uncertainty many problems in this world appear to be complex. The uncertainty may be either in parameters defining the problem

More information

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules DATA MINING LECTURE 4 Frequent Itemsets and Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information