Mining Infrequent Patter ns

Similar documents
Handling a Concept Hierarchy

DATA MINING - 1DL105, 1DL111

COMP 5331: Knowledge Discovery and Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining

Data mining, 4 cu Lecture 5:

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining Concepts & Techniques

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

732A61/TDDD41 Data Mining - Clustering and Association Analysis

CSE 5243 INTRO. TO DATA MINING

Associa'on Rule Mining

D B M G Data Base and Data Mining Group of Politecnico di Torino

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Association Rules. Fundamentals

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Sequential Pattern Mining

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

CSE 5243 INTRO. TO DATA MINING

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING - 1DL360

Mining Positive and Negative Fuzzy Association Rules

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining and Analysis: Fundamental Concepts and Algorithms

DATA MINING - 1DL360

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

COMP 5331: Knowledge Discovery and Data Mining

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Association Analysis. Part 1

Machine Learning: Pattern Mining

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

CS 484 Data Mining. Association Rule Mining 2

Frequent Itemset Mining

Data Warehousing & Data Mining

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Warehousing & Data Mining

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

CS 584 Data Mining. Association Rule Mining 2

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Chapters 6 & 7, Frequent Pattern Mining

CS5112: Algorithms and Data Structures for Applications

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Introduction to Data Mining

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Association Analysis. Part 2

Removing trivial associations in association rule discovery

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Association Rule Mining on Web

1 Frequent Pattern Mining

FP-growth and PrefixSpan

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Association Analysis Part 2. FP Growth (Pei et al 2000)

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Selecting a Right Interestingness Measure for Rare Association Rules

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Data Warehousing. Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Frequent Itemset Mining

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

PRESENTATION TITLE. Undeclared Students Paths to Major Declaration: A Data Mining Study. Afshin Karimi Sunny Moon, PhD Joshua Loudon Brian Stern

Summary. 8.1 BI Overview. 8. Business Intelligence. 8.1 BI Overview. 8.1 BI Overview 12/17/ Business Intelligence

Data mining, 4 cu Lecture 7:

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Frequent Itemset Mining

NEGATED ITEMSETS OBTAINING METHODS FROM TREE- STRUCTURED STREAM DATA

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #11: Frequent Itemsets

Assignment 3: Chapter 2 & 3 (2.6, 3.8)

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview

Chapter 4: Frequent Itemsets and Association Rules

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength

Descrip<ve data analysis. Example. E.g. set of items. Example. Example. Data Mining MTAT (6EAP)

Approximate counting: count-min data structure. Problem definition

Correlation Preserving Unsupervised Discretization. Outline

Association Analysis 1

Frequent Pattern Mining: Exercises

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example

Two Novel Pioneer Objectives of Association Rule Mining for high and low correlation of 2-varibales and 3-variables

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

DATA MINING FOR TELECONNECTIONS IN GLOBAL CLIMATE

1. Data summary and visualization

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context.

Unit II Association Rules

Frequent Itemset Mining

arxiv: v1 [cs.db] 31 Dec 2011

CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS. Due to elements of uncertainty many problems in this world appear to be

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s

Transcription:

Mining Infrequent Patter ns JOHAN BJARNLE (JOHBJ551) PETER ZHU (PETZH912) LINKÖPING UNIVERSITY, 2009 TNM033 DATA MINING

Contents 1 Introduction... 2 2 Techniques... 3 2.1 Negative Patterns... 3 2.2 Negative Correlated Patterns... 3 2.3 Mining Negative Patterns... 5 2.4 Mining Negative Correlated Patterns... 5 2.4.1 Concept Hierarchy... 5 2.4.2 Indirect Association... 7 3 Applications... 8 4 Conclusion... 8 References... 8 1 Introduction Often when considering data mining, the focus is on frequent patterns. Although the majority of the most interesting patterns will lie within the frequent ones, there are important patterns that will be ignored with this approach. These are called infrequent patterns. Take for example the sale of VHS:s and DVD:s. There will be low occurrences of people buying both of them. In terms of data mining, the itemset {VHS, DVD} will be infrequent and therefore ignored. However, people that buys DVD:s does not tend to buy VHS:s and vice versa. These items will be competing and an interesting pattern is found. Data mining comes with a lot of concepts and terms. To be able to follow and understand the concepts and definitions in this report, one should be familiar with two basic terms, support and confidence. Support is defined as how frequent an itemset is among all the records: SUPPORT: = ( ) where ( ) is the number of occurrences of itemset and is the total number of records. Confidence is defined as the probability function ( ) which measures how often the itemset X appears in records that contains itemset Y. CONFIDENCE: = ( ) ( ) where ( ) and ( ) are defined as the number of occurrences of the itemsets X respective Y. 2 (8)

Infrequent patterns are described as itemsets or rules that don t pass the minsup criteria. The minsup threshold is manually set and separates frequent- and infrequent patterns. If the minsup is set too high then itemsets involving interesting rare items could be missed. For example, in a shopping mall, itemsets containing expensive colognes might be ignored, causing eventual interesting patterns to be missed. DEFINITION 1: An infrequent pattern is an itemset or a rule whose support is less than the minsup threshold. Usually most of the infrequent patterns are not interesting, and they also account for a large number of itemsets. Good algorithms to filter out interesting patterns which at the same time are computationally efficient are required. The definitions in this report were extracted from Introduction to Data Mining [1]. 2 Techniques 2.1 Negative Patterns Given an itemset ={,,, }, denotes the absence of item in the itemset. For example, corresponds to the absence of. With the negative item defined, a negative itemset can be described as an itemset with both positive and negative items, passing the minsup threshold. DEFINITION 2: A negative itemset X contains positive items A and negative items, =, where 1 and ( ). From these negative itemsets, negative association rules can be defined. DEFINITION 3: A negative association rules has to have following three properties; the rules is extracted from a negative itemset, the support of the rule is greater than or equal to minsup, and the confidence of the rule is greater than or equal to minconf. Such rules will be denoted as, which means that people who buys coffee tends to not buy tea. Negative itemsets and negative association rules will be referred to as negative patterns in this report. 2.2 Negative Correlated Patterns The basic idea of a negative correlated pattern is that the actual support should be less than the expected support, as stated in the definitions below: DEFINITION 4: An itemset X is negatively correlated if ( )< = ( ) ( ) ( ), where ( ) is the support of an item. 3 (8)

The right hand side, the expected support, represents an estimate of the probability that all the items in X are statistically independent. The smaller the support is compared to the expected support, the more negatively correlated the pattern is. DEFINITION 5: An association rule is negatively correlated if ( )< ( ) ( ), where X and Y are disjoint itemsets, for example =. This definition, however, only provides a partial condition for negative correlation between items in X and Y. The full condition is stated below: where and. ( )< ( ), The negatively correlated itemsets and association rules will be referred to as negatively correlated patterns. These patterns are closely related. Infrequent patterns and negatively correlated patterns refer to itemsets that only contain positive items. This is shown in the figure below. Figure 1. Comparisons between infrequent patterns, negative patterns, and negative correlated patterns. 4 (8)

2.3 Mining Negative Patterns This technique utilizes a binary table, where every itemset is translated into binary values. If one considers the earlier described negative patterns, the records can be binarized by augmenting it with negative items. The tables below demonstrate how this works. RecordID Items 1 {A,B} 2 {B,C} 3 {A} Table 1. Original records table. RecordID A A B B C C 1 1 0 1 0 0 1 2 0 1 1 0 1 0 3 1 0 0 1 0 1 Table 2. The records translated to binary values. This can then be combined with an algorithm such as Apriori [2], to derive all the negative itemsets. This technique is only usable when the itemsets covers few items, because of the following reasons: 1. The number of items is doubled, since every item also has a corresponding negative value. When exploring an itemset, a larger set than the usual 2 has to be evaluated, where the d stands for number of items in the original data set. 2. Pruning based on support is no longer effective, since either or will have a value greater than or equal 50%. 3. The width of each record will increase when adding the corresponding negative part for each item. If the total number of items is d, then a record originally is much smaller than d. Consider a food market, then each record is almost guaranteed to not include all items. However, using this technique, the width of each record will be d. Many of the existing algorithms can t handle these kinds of datasets due to the large data size, and will eventually break down. To decrease the computational cost; the support computation for a negative itemset can be done by calculating the support for the positive parts only. EQUATION 1: ( )= ( )+, {( ) ( )} 2.4 Mining Negative Correlated Patterns The second technique is based on the expected support. An infrequent pattern is considered to be interesting if the actual support is less than the expected support. Two approaches to calculate the expected support will be discussed. The first is based upon a concept hierarchy and the second uses a neighborhood-based approach. 2.4.1.1 Concept Hierarchy The usual support calculation (see support) is not a sufficient measure to filter away uninteresting infrequent patterns. Consider an itemset e.g. {TV, coke}, whose items are frequent. The itemset is considered to be infrequent due to low support and maybe negatively correlated, but it is uninteresting 5 (8)

due to the fact that the items belong to different categories. For an expert this may seem to be obvious, however for a computer to ignore these kinds of infrequent patterns, a good measure has to be used. Figure 2. Concept hierarchy This approach is based on the assumption that products within same product family have similar types of interaction with other items. For example, if {Cookies, Pork} (see figure 2) is a frequent itemset then their children will have similar relations. The expected support for {Oatmeal, Bacon} can then be calculated with the following equation: EXAMPLE 1: (, ) = (, ) ( ) ( ) ( ) ( ) If the actual support for {Oatmeal, Bacon} is less than the expected support calculated above then we have found an interesting infrequent pattern. The algorithm for this case where we only consider direct children of the frequent itemset (in example 1 {Cookies, Pork}) can be generalized into the following: EQUATION 2: (,,, ) = ( ) ( ) ( ) ( ) ( ) ( ) ( ) where {,,, } is a frequent itemset and denotes the parent of. In the second case we can consider one of the items in the frequent itemset and one child from the other item, for example{cookies, Bacon}. The general equation for this: EQUATION 3: (,,,, ) = ( ) ( ) ( ) ( ) ( ) 6 (8)

The third case considers one of the items in the frequent itemset and one sibling from the other item, for example {Cookies, Chicken}. The general equation for this: EQUATION 4: (,,,, ) = ( ) ( ) ( ) ( ) ( ) where denotes the sibling of. In our example we only take into account that the frequent itemset contains two items to make it simpler to understand, however this is not the case when mining data. These three cases are the only interesting when mining for negative itemsets, for example we do not consider the itemset where we only have the siblings. 2.4.2.2 Indirect Association In the previous section we described how to calculate the expected support using concept hierarchy. This technique will instead use indirect association to get the expected support. Indirect association means looking through a set of items in common, the mediator set. Figure 3. Indirect association between a and b through a mediator set Y. DEFINITION 6: A pair of items a, b is indirectly associated through a mediator set Y if the following conditions hold: 1. ({, })< (Itempair support condition). 2. 0 such that: I. ({ } ) and ({ } ) (Mediator support condition). II. ({ }, ), ({ }, ), where (, ) is an objective measure of the association between X and Z (Mediator dependence condition). 7 (8)

3 Applications Infrequent patterns can be used in many applications. In text mining, indirect associations can used to find synonyms, antonym or words that are used in different contexts. For example, the word data might be indirectly associated with the word gold, using the mediator mining. In the market basket domain, indirect associations can be used to find competing items, such as desktop computers and laptops, which states that people whom buys desktop computers won t buy laptops. Infrequent patterns can be used to detect errors. For example, if {Fire = Yes} is frequent, but {Fire = Yes, Alarm = On} is infrequent, then the alarm system probably is faulting. When evaluating Weka [3], we could not find any signs of an infrequent pattern classifier. 4 Conclusion The techniques discussed in this report show that we can find interesting infrequent patterns that wouldn t be discovered with the usual algorithms. One may believe that these techniques are computationally heavy, but that s not the case. Savasere et. al [4] ran the negative correlation algorithm technique on a SPARCStation 5 which was released almost fifteen years ago [5]. The time it took to complete the algorithm over 50,000 records with an average of ten items per itemset was within the interval 300 to 1000 seconds. On a modern computer the same operations would probably be done within seconds. Noteworthy is that negative and negative correlated patterns do not have to be infrequent, see figure 1. It is a shame that Weka does not contain any of the discussed algorithms for finding infrequent patterns. However Apriori is implemented and since the binary tables uses Apriori to find negative patterns, it should be possible to extend the current Apriori in Weka. References [1] Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Addison-Wesley. Pages 457-472, 2006. [2] Aida Vitoria, lecture 7, powerpoint slides, Linköping University 2009. [3] Weka, http://www.cs.waikato.ac.nz/ml/weka/, 2009 [4] Ashok Savasere, Edward Omiecinski, Shamkant Navathe, Mining for Strong Negative Associations in a Large Database of Customer Transactions. In Proc. Of the 14 th Intl. Conf. on Data Engineering, pages 494-502, 1998. [5] Wikipedia, http://en.wikipedia.org/wiki/sparcstation_5, 2009. 8 (8)