Frequent Itemset Mining

Similar documents
Frequent Itemset Mining

Frequent Itemset Mining

CS 484 Data Mining. Association Rule Mining 2

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

Data Mining Concepts & Techniques

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

Data Mining and Analysis: Fundamental Concepts and Algorithms

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Temporal Data Mining

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

COMP 5331: Knowledge Discovery and Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining

CS 584 Data Mining. Association Rule Mining 2

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

732A61/TDDD41 Data Mining - Clustering and Association Analysis

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

DATA MINING - 1DL360

DATA MINING - 1DL360

Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F

Associa'on Rule Mining

Machine Learning: Pattern Mining

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Chapters 6 & 7, Frequent Pattern Mining

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

Frequent Pattern Mining: Exercises

Frequent Itemset Mining

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Association Analysis. Part 2

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Association Analysis. Part 1

Association Analysis Part 2. FP Growth (Pei et al 2000)

OPPA European Social Fund Prague & EU: We invest in your future.

Sequential Pattern Mining

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

COMP 5331: Knowledge Discovery and Data Mining

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Association Analysis 1

CSE 5243 INTRO. TO DATA MINING

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data mining, 4 cu Lecture 5:

Knowledge Discovery and Data Mining I

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

Unit II Association Rules

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

On Condensed Representations of Constrained Frequent Patterns

Data Warehousing & Data Mining

Interesting Patterns. Jilles Vreeken. 15 May 2015

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Data Warehousing & Data Mining

Handling a Concept Hierarchy

Descrip<ve data analysis. Example. E.g. set of items. Example. Example. Data Mining MTAT (6EAP)

FP-growth and PrefixSpan

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example

Mining Infrequent Patter ns

Association)Rule Mining. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management

Association Rule Mining on Web

Four Paradigms in Data Mining

Data Warehousing. Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Data Mining Concepts & Techniques

Effective Elimination of Redundant Association Rules

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Summary. 8.1 BI Overview. 8. Business Intelligence. 8.1 BI Overview. 8.1 BI Overview 12/17/ Business Intelligence

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

Data mining, 4 cu Lecture 7:

Discovering Non-Redundant Association Rules using MinMax Approximation Rules

On Minimal Infrequent Itemset Mining

Constraint-based Subspace Clustering

arxiv: v1 [cs.db] 31 Dec 2011

Mining Positive and Negative Fuzzy Association Rules

DATA MINING - 1DL105, 1DL111

Mining Free Itemsets under Constraints

Density-Based Clustering

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

A Global Constraint for Closed Frequent Pattern Mining

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

Overview. More counting Permutations Permutations of repeated distinct objects Combinations

Data Mining and Analysis: Fundamental Concepts and Algorithms

CS 412 Intro. to Data Mining

Mining Approximative Descriptions of Sets Using Rough Sets

Removing trivial associations in association rule discovery

Transcription:

ì 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM COCONUT Team IMAGINA 16/17 Webpage: h;p://www.lirmm.fr/~lazaar/teaching.html Email: lazaar@lirmm.fr

2 Data Mining ì Data Mining (DM) or Knowledge Discovery in Databases (KDD) revolves around the invessgason and creason of knowledge, processes, algorithms, and the mechanisms for retrieving poten9al knowledge from data collec9ons.

3 Game Data Mining ì Data about players behavior, server performance, system funcsonality ì How to convert these data into something meaningful? ì How to move from raw data to acsonable insights? è Game data mining is the answer www.kdnuggets.com/2010/08/video-tutorial-chrissan-thuraudata-mining-in-games.html

4 Frequent Itemset Mining: Motivations Frequent Itemset Mining is a method for market basket analysis. It aims at finding regularises in the shopping behavior of customers of supermarkets, mail-order companies, on-line shops etc. ì ì ì More specifically: Find sets of products that are frequently bought together. Possible applicasons of found frequent itemsets: ì Improve arrangement of products in shelves, on a catalog s pages etc. ì Support cross-selling (suggesson of other products), product bundling. ì Fraud detecson, technical dependence analysis, fault localizason etc. Oben found pa;erns are expressed as associason rules, for example: ì If a customer buys bread and wine, then she/he will probably also buy cheese.

5 Frequent Itemset Mining: Basic notions ì Items: ì Itemset: ì TransacSons: ì Language of itemsets: ì TransacSonal dataset: ì Cover of an itemset: ì (absolute) Frequency: I = {i 1,...i n } P I T = {t 1,...t m } L I =2 I D I T S(P )={t i 2 T P t i } freq(p )= S(P)

6 Absolute/relative frequency ì Absolute Frequency: freq(p )= S(P ) ì RelaSve Frequency: freq(p )= 1 m S(P )

7 Frequent Itemset Mining: Definition ì Given: ì ì ì A set of items I = {i 1,...i n } A set of transacsons overs the items A minimum support T = {t 1,...t m } ì The need: ì The set of itemset P s.t.: freq(p )

8 Example (1) I = {a, b, c, d, e} T = {1...10} H D V D M D S(bc) ={2, 7, 9} freq(bc) =3 Q: is it an absolute or relasve frenquency?

9 Example (1) ;

10 Example (1) ; 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 0

11 Example (1) Frequent itemset? ; 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 0

12 Example (1) Frequent itemset with minimum support θ=3? ; 7 3 7 6 7 4 5 6 3 1 1 4 4 4 3 3 4 1 1 2 2

13 Searching for Frequent Itemsets ì A naïve search that consists of enumerasng and tessng the frequency of itemset candidates in a given dataset is usually infeasible. ì Why? Number of items (n) Search space (2 n ) 10 10 3 20 10 6 30 10 9 100 10 30 128 10 68 (atoms in the universe) 1000 10 301

14 Anti-monotonicity property ì Given a transacson database D over items I and two itemsets X, Y: ì That is, X Y ) S(Y ) S(X) X Y ) freq(y ) apple freq(x)

15 Example (2) S(ade) ={1, 4, 8, 10} freq(ade) =4 S(acde) ={4, 8} freq(acde) =2 7 3 ; 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 ade 2 0 0 0 0 2 acde 0

16 Apriori property ì Given a transacson database D over items I, a minsup θ and two itemsets X, Y: X Y ) freq(y ) apple freq(x) ì It follows: X Y ) (freq(y ) ) freq(x) ) All subsets of a frequent itemset are frequent! ì ContraposiSon: X Y ) (freq(x) < ) freq(y ) < ) All supersets of an infrequent itemset are infrequent!

17 Example (3) All subsets of a frequent itemset are frequent! =2 7 3 7 6 7 a c d e ; 0 4 5 6 3 1 1 4 4 4 ac ad ae cd ce de 0 0 0 3 3 4 1 1 0 acd ace ade 2 cde 0 0 0 0 2 acde 0

18 Example (3) All supersets of an infrequent itemset are infrequent! =2 7 3 ; 7 6 7 0 4 5 6 3 1 1 4 4 4 be 0 0 0 3 3 4 1 1 0 abe bce bde 2 0 0 0 abce abde 2 0 bcde 0 abcde

19 Partially ordered sets ì A parsal order is a binary relason R over a set S : 8x, y, z 2 S x R x x R y ^ y R x ) x = y x R y ^ y R z ) x R z (reflexivity) (ans-symmetry) (transisvity) ; S =? R =?

20 Poset (2 I, ) ì Comparable itemsets: ì Incomparable itemsets: x y _ y x x 6 y ^ y 6 x ;

21 Apriori Algorithm [Agrawal and Srikant 1994] ì Determine the support of the one-element item sets (i.e. singletons) and discard the infrequent items. ì Form candidate itemsets with two items (both items must be frequent), determine their support, and discard the infrequent itemsets. ì Form candidate item sets with three items (all contained pairs must be frequent), determine their support, and discard the infrequent itemsets. ì And so on! Based on candidate generason and pruning

Apriori Algorithm [Agrawal and Srikant 1994] 22

23 Apriori candidates generation apriori gen(l k ): E ; 8P 0,P 00 2 L k s.t. : (P 0 = {i 1,...,i k 1,i k }) ^ (P 00 = {i 1,...,i k 1,i 0 k }) P P 0 [ P 00 //{i 1,...,i k 1,i k,i 0 k } if 8i 2 P : P \{i} 2 L k then E E [ {P } return E

24 Improving candidates generation ì Using apriori-gen funcson, an item of k+1 size can be generated in a j possible ways: j = k(k+1) 2 ì Need: Generate itemset candidate at most once. ì How: Assign to each itemset a unique parent itemset, from which this itemset is to be generated

25 Improving candidates generation ì Assigning unique parents turns the poset latce into a tree:

26 Canonical form for itemsets ì An itemset can be represented as a word over an alphabet I ì Q: how many words of 3 items can we have? Of 4 items? Of k items? k! ì An arbitrary order (e.g., lexicography order) on items can give a canonical form, a unique representason of itemsets by breaking symmetries. ì Lex on items : abc < acb < bac < bca...

Recursive processing with Canonical 27 forms ì Foreach P of a given level, generate all possible extension of P by one item such that: child(p )={P 0 :(i/2 P ) ^ (P 0 = P [ {i}) ^(c(p ).last < i) ^ (P 0 is frequent)} ì Foreach P, process it recursively.

28 Example (4) Q: what are the children of: 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 child(p )={P 0 :(i/2 P ) ^ (P 0 = P [ {i}) ^(c(p 0 ).last < i) ^ (P 0 is frequent)} 0

29 Items Ordering ì Any order can be used, that is, the order is arbitrary ì The search space differs considerably depending on the order ì Thus, the efficiency of the Frequent Itemset Mining algorithms can differ considerably depending on the item order ì Advanced methods even adapt the order of the items during the search: use different, but compasble orders in different branches

30 Items Ordering (heuristics) ì Frequent itemsets consist of frequent items ì Sort the items w.r.t. their frequency. (decreasing/increasing) ì The sum of transacson sizes, transacson containing a given item, which captures implicitly the frequency of pairs, triplets etc. ì Sort items w.r.t. the sum of the sizes of the transacsons that cover them.

ì 31 Tutorials link: h;p://www.lirmm.fr/~lazaar/imagina/td1.pdf

Condensed representason of Frequent Itemsets: Closed and Maximal Itemsets 32

33 Maximal Itemsets ì The set of Maximal (frequent) Itemsets: M = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) < } ì That is: An item set is maximal if it is frequent, but none of its proper supersets is frequent. 8, 8P 2 F :(P 2 M ) _ (9P 0 P : freq(p 0 ) )

34 Maximal Itemsets ì Every frequent itemset has a maximal superset: 8, 8P 2 F :(9P 0 2 M : P P 0 ) ì The maximal itemsets are a condensed representason of the frequent itemsets where: 8 : F = [ P 2M 2 P

35 Example (5) Here are the Frequent itemset with minsup θ=3 Q: What are the maximal itemsets minsup θ=3? ; 7 3 7 6 7 4 5 6 3 4 4 4 bc 3 3 4 acd ace ade M = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) < }

36 Maximal Itemsets (limitation) The set of maximal itemsets captures the set of all frequent itemsets BUT it does not preserve the knowledge of all support values THE NEED Can we have a condensed representason of the set of frequent itemsets, which preserves knowledge of all support values?

37 Closed Itemsets ì The set of Closed (frequent) Itemsets: C = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) <freq(p )} An item set is closed if it is frequent, but none of its proper supersets has the same support ì That is: 8, 8P 2 F :(P 2 C ) _ (9P 0 P : freq(p 0 )=freq(p ))

38 Closed Itemsets ì Every frequent itemset has a closed superset: 8, 8P 2 F :(9P 0 2 C : P P 0 ) ì The closed itemsets are a condensed representason of the frequent itemsets where: 8 : F = [ P 2C 2 P

39 Closed Itemsets ì Every frequent itemset has a closed superset with the same support ì The set of all closed itemsets preserves knowledge of all support values: 8, 8P 2 F : S(P ) = max P 0 2C,P 0 P S(P 0 ) ì Which is not the case with the maximal itemsets: 8, 8P 2 F : S(P ) max P 0 2M,P 0 P S(P 0 )

40 Example (6) Here are the Frequent itemset with minsup θ=3 Q: are b and de Closed itemsets? ; 7 3 b 7 6 7 4 5 6 3 4 4 4 de 3 3 4 C = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) <freq(p )}

41 Maximal/Closed itemsets maximal closed Q: Can the Closed itemsets be expressed using maximal ones? [ C = i2{,...,n} M i

42 Frequent/Closed/Maximal Dataset #Frequent #Closed #Maximal Zoo-1 151 807 3 292 230 Mushroom 155 734 3 287 453 Lymph 9 967 402 46 802 5 191 HepaSSs 27. 10 7 + 1 827 264 189 205 h;ps://dtai.cs.kuleuven.be/cp4im/datasets/

ì 43 Tutorials link: h;p://www.lirmm.fr/~lazaar/imagina/td2.pdf