Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Similar documents
Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

COMP 5331: Knowledge Discovery and Data Mining

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Handling a Concept Hierarchy

1 Frequent Pattern Mining

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

CS 484 Data Mining. Association Rule Mining 2

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

732A61/TDDD41 Data Mining - Clustering and Association Analysis

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

COMP 5331: Knowledge Discovery and Data Mining

Association Rule Mining on Web

CSE 5243 INTRO. TO DATA MINING

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

CS 584 Data Mining. Association Rule Mining 2

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

CSE 5243 INTRO. TO DATA MINING

Lecture Notes for Chapter 6. Introduction to Data Mining

Data Mining and Analysis: Fundamental Concepts and Algorithms

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

DATA MINING - 1DL105, 1DL111

Data Mining Concepts & Techniques

Unit II Association Rules

Machine Learning: Pattern Mining

Associa'on Rule Mining

DATA MINING - 1DL360

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Chapters 6 & 7, Frequent Pattern Mining

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Density-Based Clustering

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

DATA MINING - 1DL360

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Un nouvel algorithme de génération des itemsets fermés fréquents

Association Analysis Part 2. FP Growth (Pei et al 2000)

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Geovisualization for Association Rule Mining in CHOPS Well Data

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Mining Positive and Negative Fuzzy Association Rules

Association Analysis. Part 1

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

CS5112: Algorithms and Data Structures for Applications

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Data mining, 4 cu Lecture 5:

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Association Rules. Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION. Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Mining Infrequent Patter ns

Unsupervised Learning. k-means Algorithm

Data-Driven Logical Reasoning

Mining Strong Positive and Negative Sequential Patterns

CS 412 Intro. to Data Mining

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Data Warehousing & Data Mining

Approximate counting: count-min data structure. Problem definition

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Data Warehousing & Data Mining

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

Discovering Non-Redundant Association Rules using MinMax Approximation Rules

Knowledge Discovery and Data Mining I

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

Sequential Pattern Mining

Selecting a Right Interestingness Measure for Rare Association Rules

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

CPDA Based Fuzzy Association Rules for Learning Achievement Mining

NetBox: A Probabilistic Method for Analyzing Market Basket Data

Data mining, 4 cu Lecture 7:

Frequent Itemset Mining

Quantitative Association Rule Mining on Weighted Transactional Data

An Approach to Classification Based on Fuzzy Association Rules

Data Warehousing. Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Summary. 8.1 BI Overview. 8. Business Intelligence. 8.1 BI Overview. 8.1 BI Overview 12/17/ Business Intelligence

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

OPPA European Social Fund Prague & EU: We invest in your future.

Data Mining Concepts & Techniques

A Novel Approach of Multilevel Positive and Negative Association Rule Mining for Spatial Databases

CS738 Class Notes. Steve Revilak

Constraint-Based Rule Mining in Large, Dense Databases

Introduction to Data Mining

1. Data summary and visualization

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

NEGATED ITEMSETS OBTAINING METHODS FROM TREE- STRUCTURED STREAM DATA

Classification Using Decision Trees

SPATIO-TEMPORAL PATTERN MINING ON TRAJECTORY DATA USING ARM

Transcription:

Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation 12.3.2008 Lauri Lahti

Association rules Techniques for data mining and knowledge discovery in databases Five important algorithms in the development of association rules (Yilmaz et al., 2003): AIS algorithm 1993 SETM algorithm 1995 Apriori, AprioriTid and AprioriHybrid 1994

Apriori algorithm Developed by Agrawal and Srikant 1994 Innovative way to find association rules on large scale, allowing implication outcomes that consist of more than one item Based on minimum support threshold (already used in AIS algorithm) Three versions: Apriori (basic version) faster in first iterations AprioriTid faster in later iteratons AprioriHybrid can change from Apriori to AprioriTid after first iterations

Limitations of Apriori algorithm Needs several iterations of the data Uses a uniform minimum support threshold Difficulties to find rarely occuring events Alternative methods (other than appriori) can address this by using a non-uniform minimum support thresold Some competing alternative approaches focus on partition and sampling

Phases of knowledge discovery 1) data selection 2) data cleansing 3) data enrichment (integration with additional resources) 4) data transformation or encoding 5) data mining 6) reporting and display (visualization) of the discovered knowledge (Elmasri and Navathe, 2000)

Application of data mining Data mining can typically be used with transactional databases (for ex. in shopping cart analysis) Aim can be to build association rules about the shopping events Based on item sets, such as {milk, cocoa powder} 2-itemset {milk, corn flakes, bread} 3-itemset

Association rules Items that occur often together can be associated to each other These together occuring items form a frequent itemset Conclusions based on the frequent itemsets form association rules For ex. {milk, cocoa powder} can bring a rule cocoa powder milk

Sets of database Transactional database D All products an itemset I = {i 1, i 2,, i m } Unique shopping event T I T contains itemset X iff X T Based on itemsets X and Y an association rule can be X Y It is required that X I, Y I and X Y =

Properties of rules Types of item values: boolen, quantitative, categorical Dimensions of rules 1D: buys(cocoa powder) buys(milk) 3D: age(x, under 12 ) gender(x, male ) buys(x, comic book ) Latter one is an example of a profile association rule Intradimension rules, interdimension rules, hybriddimension rules (Han and Kamber, 2001) Concept hierarchies and multilevel association rules

Quality of rules Interestingness problem (Liu et al., 1999): some generated rules can be self-evident some marginal events can dominate interesting events can be rarely occuring Need to estimate how interesting the rules are Subjective and objective measures

Subjective measures Often based on earlier user experiences and beliefs Unexpectedness: rules are interesting if they are unknown or contradict the existing knowledge (or expectations). Actionability: rules are interesting if users can get advantage by using them Weak and strong beliefs

Objective measures Based on threshold values controlled by the user Some typical measures (Han and Kamber, 2001): simplicity support (utility) confidence (certainty)

Simplicity Focus on generating simple association rules Length of rule can be limited by userdefined threshold With smaller itemsets the interpretation of rules is more intuitive Unfortunately this can increase the amount of rules too much Quantitative values can be quantized (for ex. age groups)

Simplicity, example One association rule that holds association between cocoa powder and milk buys(cocoa powder) buys(bread,milk,salt) More simple and intuitive might be buys(cocoa powder) buys(milk)

Support (utility) Usefulness of a rule can be measured with a minimum support threshold This parameter lets to measure how many events have such itemsets that match both sides of the implication in the association rule Rules for events whose itemsets do not match boths sides sufficiently often (defined by a threshold value) can be excluded

Support (utility) (2) Database D consists of events T 1, T 2, T m, that is D = {T 1, T 2,, T m } Let there be an itemset X that is a subregion of event T k, that is X T k The support can be defined as {T k D X T k } sup(x) = ------------------------------ D This relation compares number of events containing itemset X to number of all events in database

Support (utility), example Let s assume D = {(1,2,3), (2,3,4), (1,2,4), (1,2,5), (1,3,5)} The support for itemset (1,2) is {T k D X T k } sup((1,2)) = --------------------------- = 3/5 D That is: relation of number of events containing itemset (1,2) to number of all events in database

Confidence (certainty) Certainty of a rule can be measured with a threshold for confidence This parameter lets to measure how often an event s itemset that matches the left side of the implication in the association rule also matches for the right side Rules for events whose itemsets do not match sufficiently often the right side while mathching the left (defined by a threshold value) can be excluded

Confidence (certainty) (2) Database D consists of events T 1, T 2, T m, that is: D = {T 1, T 2,, T m } Let there be a rule X a X b so that itemsets X a and X b are subregions of event T k, that is: X a T k X b T k Also let X a X b = The confidence can be defined as sup(x a X b ) conf(x a,x b ) = --------------------- sup(x a ) This relation compares number of events containing both itemsets X a and X b to number of events containing an itemset X a

Confidence (certainty), example Let s assume D = {(1,2,3), (2,3,4), (1,2,4), (1,2,5), (1,3,5)} The confidence for rule 1 2 sup(1 2) 3/5 conf((1,2)) = ---------------- = ------ = 3/4 sup(1) 4/5 That is: relation of number of events containing both itemsets X a and X b to number of events containing an itemset X a

Support and confidence If confidence gets a value of 100 % the rule is an exact rule Even if confidence reaches high values the rule is not useful unless the support value is high as well Rules that have both high confidence and support are called strong rules Some competing alternative approaches (other that Apriori) can generate useful rules even with low support values

Generating association rules Usually consists of two subproblems (Han and Kamber, 2001): 1) Finding frequent itemsets whose occurences exceed a predefined minimum support threshold 2) Deriving association rules from those frequent itemsets (with the constrains of minimum confidence threshold) These two subproblems are soleved iteratively until new rules no more emerge The second subproblem is quite straight- forward and most of the research focus is on the first subproblem

Use of Apriori algorithm Initial information: transactional database D and user-defined numeric minimun support threshold min_sup Algortihm uses knowledge from previous iteration phase to produce frequent itemsets This is reflected in the Latin origin of the name that means from what comes before

Creating frequent sets Let s define: C k as a candidate itemset of size k L k as a frequent itemset of size k Main steps of iteration are: 1)Find frequent set L k-1 2)Join step: C k is generated by joining L k-1 with itself (cartesian product L k-1 x L k-1 ) 3)Prune step (apriori property): Any (k 1) size itemset that is not frequent cannot be a subset of a frequent k size itemset, hence should be removed 4)Frequent set L k has been achieved

Creating frequent sets (2) Algorithm uses breadth-first search and a hash tree structure to make candidate itemsets efficiently Then occurance frequency for each candidate itemset is counted Those candidate itemsets that have higher frequency than minimum support threshold are qualified to be frequent itemsets

Apriori algorithm in pseudocode L 1 = {frequent items}; for (k= 2; L k-1!= ; k++) do begin C k = candidates generated from L k-1 (that is: cartesian product L k-1 x L k-1 and eliminating any k-1 size itemset that is not frequent); for each transaction t in database do increment the count of all candidates in C k that are contained in t L k = candidates in C k with min_sup end return k L k ; (www.cs.sunysb.edu/~cse634/lecture_notes/07apriori.pdf)

Apriori algorithm in pseudocode (2) Join step and prune step (Wikipedia)