Un nouvel algorithme de génération des itemsets fermés fréquents

Size: px

Start display at page:

Download "Un nouvel algorithme de génération des itemsets fermés fréquents"

Jody Shepherd
6 years ago
Views:

1 Un nouvel algorithme de génération des itemsets fermés fréquents Huaiguo Fu CRIL-CNRS FRE2499, Université d Artois - IUT de Lens Rue de l université SP 16, Lens cedex. France. fu@cril.univ-artois.fr

2 Motivation P. 2/32 Develop data mining tools based on formal concept analysis Concept lattices is an effective tool for data analysis Develop scalable concept lattice algorithm to reduce expensive cost when mining very large data Problem: very large data still bothers us Situation: applications of concept lattice suffer from the complexity of the concept lattice algorithms for large dataset One solution: divide-and-conquer

3 Outline P. 3/32 1 Introduction 2 Concept lattice 3 Mining frequent closed itemsets 4 PFC algorithm 5 Conclusion

4 ❶ Introduction 1 Introduction 2 Concept lattice 3 Mining frequent closed itemsets 4 PFC algorithm 5 Conclusion

5 ❶ Introduction Concept lattice P. 5/32 The core of formal concept analysis Derived from mathematical order theory and lattice theory Study of the generation and relation of formal concepts: how objects can be hierarchically grouped together according to their common attributes generate knowledge rules from concept lattice Data Concept lattice Knowledge

6 ❶ Introduction An effective tool P. 6/32 Concept lattice is an effective tool for Data analysis Knowledge discovery and data mining Information retrieval Software engineering...

7 ❶ Introduction Concept lattice for Data mining (DM) P. 7/32 Association rules mining (ARM) Frequent Closed Itemset (FCI) mining CLOSE, A-close, Titanic... CLOSET, CHARM, CLOSET+... Supervised classification GRAND, LEGAL, GALOIS RULEARNER, CIBLe, CLNN&CLNB... Clustering analysis based on concept lattices Data visualization...

8 ❶ Introduction Concept lattice algorithms P. 8/32 Many concept lattice algorithms Chein (1969) Norris (1978) Ganter (NextClosure) (1984) Bordat (1986) Godin (1991, 1995) Dowling (1993) Kuznetsov (1993) Carpineto and Romano (1993, 1996) Nourine (1999) Lindig (2000) Valtchev (2002)... However, large data impedes or limits the application of concept lattice algorithms We need to develop scalable lattice-based algorithm for data mining

9 ❷ Concept lattice 1 Introduction 2 Concept lattice 3 Mining frequent closed itemsets 4 PFC algorithm 5 Conclusion

10 ❷ Concept lattice Formal context P. 10/32 Definition Formal context is defined by a triple (O, A, R), where O and A are two sets, and R is a relation between O and A. The elements of O are called objects or transactions, while the elements of A are called items or attributes. Example Formal context D: (O, A, R) O = {1, 2, 3, 4, 5, 6, 7, 8} A = {a 1, a 2, a 3, a 4, a 5, a 6, a 7, a 8 } a 1 a 2 a 3 a 4 a 5 a 6 a 7 a

11 ❷ Concept lattice Formal concept P. 11/32 Definition Two closure operators are defined as O 1 O 1 for set O and A 1 A 1 for set A. Definition O 1 := {a A ora for all o O 1} A 1 := {o O ora for all a A 1} A formal concept of (O, A, R) is a pair (O 1, A 1 ) with O 1 O, A 1 A, O 1 = A 1 and A 1 = O 1. O 1 is called extent, A 1 is called intent.

12 ❷ Concept lattice Concept lattice P. 12/32 Definition We say that there is a hierarchical order between two formal concepts (O 1, A 1 ) and (O 2, A 2 ), if O 1 O 2 (or A 2 A 1 ). All formal concepts with the hierarchical order of concepts form a complete lattice called concept lattice a 1 A formal concept a 1a 3 extent a 1a 7 a 1a 4 a intent 1a a 1a 7a 8 a 1a 4a 6 a 1a 3a a 1a 2a a 1a 3a 7a 8 a 1a 2a 7 a 1a 2a 4a 6 23 a 1a 2a 7a 8 68 a 1a 3a 4a a 1a 2a 3a 7a 8 a 1a 3a 4a 5 a 1a 2a 3a 4a 6 a 1a 2a 3a 4a 5a 6a 7a 8

13 ❷ Concept lattice Concept lattice P. 12/32 Definition We say that there is a hierarchical order between two formal concepts (O 1, A 1 ) and (O 2, A 2 ), if O 1 O 2 (or A 2 A 1 ). All formal concepts with the hierarchical order of concepts form a complete lattice called concept lattice a 1 A formal concept a 1a 3 extent a 1a 7 a 1a 4 a intent 1a a 1a 7a 8 a 1a 4a 6 a 1a 3a a 1a 2a a 1a 3a 7a 8 a 1a 2a 7 a 1a 2a 4a 6 23 a 1a 2a 7a 8 68 a 1a 3a 4a a 1a 2a 3a 7a 8 a 1a 3a 4a 5 a 1a 2a 3a 4a 6 a 1a 2a 3a 4a 5a 6a 7a 8

14 ❸ Mining frequent closed itemsets 1 Introduction 2 Concept lattice 3 Mining frequent closed itemsets 4 PFC algorithm 5 Conclusion

15 ❸ Mining frequent closed itemsets Association rule mining P. 14/32 Finding frequent patterns, associations, correlations, or causal structures among itemsets or objects in databases Two sub-problems mining frequent itemsets generating association rules from frequent itemsets Basic Concepts Formal context (O, A, R), an itemset A 1 support: support(a 1 ) = A 1 frequent itemset: if support(a 1 ) minsupport, given a minimum support threshold minsupport

16 ❸ Mining frequent closed itemsets Algorithms of mining frequent itemsets P. 15/32 Classic algorithm: Apriori and OCD [Agrawal et al., 96] Improvement of Apriori AprioriTid, AprioriHybrid DHP (Direct Hashing and Pruning) Partition Sampling DIC (Dynamic Itemset Counting)... Other strategies FP-Growth... Problem: too many frequent itemsets if the minsupport is low Solutions: Maximal frequent itemset mining Frequent closed itemset mining

17 ❸ Mining frequent closed itemsets Closed itemset lattice P. 16/32 Definition For formal context (O, A, R), an itemset A 1 A is a closed itemset iff A 1 = A 1. Definition We say that there is a hierarchical order between closed itemsets A 1 and A 2, if A 1 A 2. All closed itemsets with the hierarchical order form of a complete lattice called closed itemset lattice. a 1 a 1a 7 a 1a 3 a 1a 2 a1a 4 a 1a 7a 8 a 1a 4a 6 a 1a 3a 7a 8 a a 1a 2a 7 1a 3a 4 a 1a 2a 4a 6 a 1a 2a 3 a 1a a 3a 4a 6 1a 2a 7a 8 a 1a 2a 3a 7a 8 a 1a 3a 4a 5 a 1a 2a 3a 4a 6 a 1a 2a 3a 4a 5a 6a 7a 8

18 ❸ Mining frequent closed itemsets FCI mining algorithms P. 17/32 Apriori-like algorithms Close A-Close... Hybrid algorithms CHARM CLOSET CLOSET+ FP-Close DCI-Closed LCM...

19 ❹ PFC algorithm 1 Introduction 2 Concept lattice 3 Mining frequent closed itemsets 4 PFC algorithm 5 Conclusion

20 ❹ PFC algorithm Analysis of search space P. 19/32 Search space of a formal context with 4 attributes a m am 1 a m 2 a m 3 a m 1a m a m 2a m 1 a m 3a m a m 3a m 2 a m 2a m a m 3a m 1 a m 2a m 1a m a m 3a m 2a m 1 a m 3a m 1a m a m 3a m 2a m Search space decomposition a m 3a m 2a m 1a m a m a m 1 a m 1a m a m 2 a m 2a m 1 a m 3a m 1 a m 2a m a m 3a m a m 2a m 1a m a m 3 a m 3a m 1a m a m 3a m 2 a m 3a m 2a m a m 3a m 2a m 1 A m A m 1 A m 2 A m 3 a m 3a m 2a m 1a m

21 ❹ PFC algorithm Analysis of search space P. 19/32 Search space of a formal context with 4 attributes a m am 1 a m 2 a m 3 a m 1a m a m 2a m 1 a m 3a m a m 3a m 2 a m 2a m a m 3a m 1 a m 2a m 1a m a m 3a m 2a m 1 a m 3a m 1a m a m 3a m 2a m Search space decomposition a m 3a m 2a m 1a m a m a m 1 a m 1a m a m 2 a m 2a m 1 a m 3a m 1 a m 2a m a m 3a m a m 2a m 1a m a m 3 a m 3a m 1a m a m 3a m 2 a m 3a m 2a m a m 3a m 2a m 1 A m A m 1 A m 2 A m 3 a m 3a m 2a m 1a m

22 ❹ PFC algorithm Analysis of search space P. 19/32 Search space of a formal context with 4 attributes a m am 1 a m 2 a m 3 a m 1a m a m 2a m 1 a m 3a m a m 3a m 2 a m 2a m a m 3a m 1 a m 2a m 1a m a m 3a m 2a m 1 a m 3a m 1a m a m 3a m 2a m Search space decomposition a m 3a m 2a m 1a m a m a m 1 a m 1a m a m 2 a m 2a m 1 a m 3a m 1 a m 2a m a m 3a m a m 2a m 1a m a m 3 a m 3a m 1a m a m 3a m 2 a m 3a m 2a m a m 3a m 2a m 1 A m A m 1 A m 2 A m 3 a m 3a m 2a m 1a m

23 ❹ PFC algorithm Analysis of search space P. 20/32 Definition Given an attribute a i A of the context (O, A, R), a set E, a i E. We define a i E = {{a i } X for all X E}. A k = {a k } {{a k } X i } X i A j = a k {a k+1, a k+2,, a m } k + 1 j m Problems: how to balance the size of partitions whether each partition contains closed itemsets

24 ❹ PFC algorithm A strategy of decomposition P. 21/32 Partitions (for the preceding example: formal context D) partition 1 A 8 A 7 A 6 A 5 partition 2 A 4 A 3 partition 3 A 2 partition 4 A 1 Results: not good! Nb c (partition 1 )=0 Nb c (partition 2 )=0 Nb c (partition 3 )=0 Nb c (partition 1 )=17 A solution: order the attributes Nb c (partition 1 )=6 Nb c (partition 2 )=6 Nb c (partition 3 )=4 Nb c (partition 1 )=1

25 ❹ PFC algorithm Ordered data context P. 22/32 Definition A formal context is called ordered data context if we order the items of formal context by number of objects of each item from the smallest to the biggest one, and the items with the same objects are merged as one item. Example a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a = a 5 a 8 a 6 a 7 a 4 a 3 a 2 a Formal context = Ordered data context

26 ❹ PFC algorithm Folding search sub-space P. 23/32 Definition An item a i of a formal context (O, A, R), all subsets of {a i, a i+1,..., a m 1, a m } that include a i, form a search sub-space (for closed itemset) that is called folding search sub-space (F3S) of a i, denoted F3S i. Remark The search space of closed itemsets: F3S m F3S m 1 F3S m 2 F3S m 3 F3S i F3S 1

27 ❹ PFC algorithm An example P. 24/32 step 1: (O, A, R) = (O, A, R), Ordered data context: a 5 a 8 a 6 a 7 a 4 a 3 a 2 a 1 step 2: step 3: DP = 0.5 = p 1 = 8, p 2 = 4, p 3 = 2, p 4 = 1 = p 1 a 1, p 2 a 7, p 3 a 8, p 4 a a }{{} 5 a 8 a 6 a 7 a 4 a 3 a 2 a 1 p4=1 }{{} p3=2 }{{} p2=4 } {{ } p1=8 partition 1 =[a 1, a 7 ): a 1 a 4 Partitions: choose: a 4 a 3 a 2 a 1 remain: a 5 a 8 a 6 a 7 choose: a 6 a 7 remain: a 5 a 8 partition 2 =[a 7, a 8 ): a 7 a 6 choose: a 8 remain: a 5 partition 3 =[a 8, a 5 ): a 8 choose: a 5 remain: partition 4 =[a 5 ): a 5 Search the closed itemsets in each partition separately: [a 1, a 7 ): a 1, a 2 a 1, a 3 a 1, a 3 a 2 a 1, a 4 a 1, a 4 a 3 a 1 [a 7, a 8 ): a 7 a 1, a 7 a 2 a 1, a 6 a 4 a 1, a 6 a 4 a 2 a 1, a 6 a 4 a 3 a 1, a 6 a 4 a 3 a 2 a 1 [a 8, a 5 ): a 8 a 7 a 1, a 8 a 7 a 2 a 1, a 8 a 7 a 3 a 1, a 8 a 7 a 3 a 2 a 1 [a 5 ): a 5 a 4 a 3 a 1

28 ❹ PFC algorithm An example P. 24/32 step 1: (O, A, R) = (O, A, R), Ordered data context: a 5 a 8 a 6 a 7 a 4 a 3 a 2 a 1 step 2: step 3: DP = 0.5 = p 1 = 8, p 2 = 4, p 3 = 2, p 4 = 1 = p 1 a 1, p 2 a 7, p 3 a 8, p 4 a a }{{} 5 a 8 a 6 a 7 a 4 a 3 a 2 a 1 p4=1 }{{} p3=2 }{{} p2=4 } {{ } p1=8 partition 1 =[a 1, a 7 ): a 1 a 4 Partitions: choose: a 4 a 3 a 2 a 1 remain: a 5 a 8 a 6 a 7 choose: a 6 a 7 remain: a 5 a 8 partition 2 =[a 7, a 8 ): a 7 a 6 choose: a 8 remain: a 5 partition 3 =[a 8, a 5 ): a 8 choose: a 5 remain: partition 4 =[a 5 ): a 5 Search the closed itemsets in each partition separately: [a 1, a 7 ): a 1, a 2 a 1, a 3 a 1, a 3 a 2 a 1, a 4 a 1, a 4 a 3 a 1 [a 7, a 8 ): a 7 a 1, a 7 a 2 a 1, a 6 a 4 a 1, a 6 a 4 a 2 a 1, a 6 a 4 a 3 a 1, a 6 a 4 a 3 a 2 a 1 [a 8, a 5 ): a 8 a 7 a 1, a 8 a 7 a 2 a 1, a 8 a 7 a 3 a 1, a 8 a 7 a 3 a 2 a 1 [a 5 ): a 5 a 4 a 3 a 1

29 ❹ PFC algorithm An example P. 24/32 step 1: (O, A, R) = (O, A, R), Ordered data context: a 5 a 8 a 6 a 7 a 4 a 3 a 2 a 1 step 2: step 3: DP = 0.5 = p 1 = 8, p 2 = 4, p 3 = 2, p 4 = 1 = p 1 a 1, p 2 a 7, p 3 a 8, p 4 a a }{{} 5 a 8 a 6 a 7 a 4 a 3 a 2 a 1 p4=1 }{{} p3=2 }{{} p2=4 } {{ } p1=8 partition 1 =[a 1, a 7 ): a 1 a 4 Partitions: choose: a 4 a 3 a 2 a 1 remain: a 5 a 8 a 6 a 7 choose: a 6 a 7 remain: a 5 a 8 partition 2 =[a 7, a 8 ): a 7 a 6 choose: a 8 remain: a 5 partition 3 =[a 8, a 5 ): a 8 choose: a 5 remain: partition 4 =[a 5 ): a 5 Search the closed itemsets in each partition separately: [a 1, a 7 ): a 1, a 2 a 1, a 3 a 1, a 3 a 2 a 1, a 4 a 1, a 4 a 3 a 1 [a 7, a 8 ): a 7 a 1, a 7 a 2 a 1, a 6 a 4 a 1, a 6 a 4 a 2 a 1, a 6 a 4 a 3 a 1, a 6 a 4 a 3 a 2 a 1 [a 8, a 5 ): a 8 a 7 a 1, a 8 a 7 a 2 a 1, a 8 a 7 a 3 a 1, a 8 a 7 a 3 a 2 a 1 [a 5 ): a 5 a 4 a 3 a 1

30 ❹ PFC algorithm An example P. 24/32 step 1: (O, A, R) = (O, A, R), Ordered data context: a 5 a 8 a 6 a 7 a 4 a 3 a 2 a 1 step 2: step 3: DP = 0.5 = p 1 = 8, p 2 = 4, p 3 = 2, p 4 = 1 = p 1 a 1, p 2 a 7, p 3 a 8, p 4 a a }{{} 5 a 8 a 6 a 7 a 4 a 3 a 2 a 1 p4=1 }{{} p3=2 }{{} p2=4 } {{ } p1=8 partition 1 =[a 1, a 7 ): a 1 a 4 Partitions: choose: a 4 a 3 a 2 a 1 remain: a 5 a 8 a 6 a 7 choose: a 6 a 7 remain: a 5 a 8 partition 2 =[a 7, a 8 ): a 7 a 6 choose: a 8 remain: a 5 partition 3 =[a 8, a 5 ): a 8 choose: a 5 remain: partition 4 =[a 5 ): a 5 Search the closed itemsets in each partition separately: [a 1, a 7 ): a 1, a 2 a 1, a 3 a 1, a 3 a 2 a 1, a 4 a 1, a 4 a 3 a 1 [a 7, a 8 ): a 7 a 1, a 7 a 2 a 1, a 6 a 4 a 1, a 6 a 4 a 2 a 1, a 6 a 4 a 3 a 1, a 6 a 4 a 3 a 2 a 1 [a 8, a 5 ): a 8 a 7 a 1, a 8 a 7 a 2 a 1, a 8 a 7 a 3 a 1, a 8 a 7 a 3 a 2 a 1 [a 5 ): a 5 a 4 a 3 a 1

31 ❹ PFC algorithm An example P. 24/32 step 1: (O, A, R) = (O, A, R), Ordered data context: a 5 a 8 a 6 a 7 a 4 a 3 a 2 a 1 step 2: step 3: DP = 0.5 = p 1 = 8, p 2 = 4, p 3 = 2, p 4 = 1 = p 1 a 1, p 2 a 7, p 3 a 8, p 4 a a }{{} 5 a 8 a 6 a 7 a 4 a 3 a 2 a 1 p4=1 }{{} p3=2 }{{} p2=4 } {{ } p1=8 partition 1 =[a 1, a 7 ): a 1 a 4 Partitions: choose: a 4 a 3 a 2 a 1 remain: a 5 a 8 a 6 a 7 choose: a 6 a 7 remain: a 5 a 8 partition 2 =[a 7, a 8 ): a 7 a 6 choose: a 8 remain: a 5 partition 3 =[a 8, a 5 ): a 8 choose: a 5 remain: partition 4 =[a 5 ): a 5 Search the closed itemsets in each partition separately: [a 1, a 7 ): a 1, a 2 a 1, a 3 a 1, a 3 a 2 a 1, a 4 a 1, a 4 a 3 a 1 [a 7, a 8 ): a 7 a 1, a 7 a 2 a 1, a 6 a 4 a 1, a 6 a 4 a 2 a 1, a 6 a 4 a 3 a 1, a 6 a 4 a 3 a 2 a 1 [a 8, a 5 ): a 8 a 7 a 1, a 8 a 7 a 2 a 1, a 8 a 7 a 3 a 1, a 8 a 7 a 3 a 2 a 1 [a 5 ): a 5 a 4 a 3 a 1

32 ❹ PFC algorithm The principle of PFC P. 25/32 PFC algorithm has two steps: 1) Determining the partitions 2) Generating independently all frequent closed itemsets of each partition Failure divide-and-conquer Success

33 ❹ PFC algorithm Determining the partitions P. 26/32 (O, A, R) (O, A, R) where A = {a 1, a 2..., a i,..., a m } Some items of A are chosen to form an order set P P = {a P T, a P T 1..., a P k,..., a P 1 }, P = T, a P k A a P T < a P T 1 <... < a P k <... < a P 2 < a P 1 = a m DP is used to choose a P k (0 < DP < 1) DP= {a 1,,a P k } {a 1,,a P k 1 } The partitions: [a P k, a P k+1 ) and [a P T ) Interval [a Pk, a Pk+1 ) is the search space from item a Pk ) [a Pk, a Pk+1 = (F 3S i ) P k i<p k+1 P T ) [a PT = F3S PT to a Pk+1

34 ❹ PFC algorithm Generating FCIs in each partition P. 27/32 Definition Given an itemset A 1 A, A 1 = {b 1, b 2,..., b i,..., b k }, b i A. A 1 is an infrequent itemset. The candidate of next closed itemset of A 1, noted CA 1, is A 1 a i = (A 1 (a 1, a 2,..., a i 1 ) {a i }), where a i < b k and a i / A 1, a i is the biggest one of A with A 1 < A 1 a i following the order: a 1 <... < a i <... < a m. Partitions are independent and non-overlapped For each partition, from an itemset A 1 If A 1 minsupport, we search the next closure of A 1 If A 1 < minsupport, we search C A 1. The closed itemsets between A 1 and CA 1 are ignored

35 ❹ PFC algorithm An example P. 28/32 Search FCIs with PFC a m 3 a m 4 a m 3a m 1 a m 3 a m 2 a m 3 a m a m 3 a m 2 a m a m 3 a m 1 a m a m 3 a m 2 a m 1 a m 3 a m 2 a m 1 a m

36 ❹ PFC algorithm Theoretical comparison P. 29/32 Extended comparison of [BenYahia and Mephu, 04] Comparison Algorithm Format Technique Architecture Parallelism A-CLOSE horizontal Generate-and-test Sequential no CLOSET+ horizontal Hybrid, FP-tree Sequential no CHARM vertical Hybrid, IT-tree, diffset Sequential no PFC relational Hybrid, partition Distributed yes

37 ❹ PFC algorithm Experimentation P. 30/32 Many FCI algorithms were compared in [FIMI03] In order to get a reference of the performance of PFC C++ for CLOSET+ vs Java for PFC Sequential architecture Data Objects Items Minsupport FCI PFC (msec) CLOSET+ Ref. audiology soybean-small lung-cancer promoters soybean-large dermatology breast-cancer-wis kr-vs-kp agaricus-lepiota connect Discussion 3 contexts with good results Density or A >> O

38 ❺ Conclusion 1 Introduction 2 Concept lattice 3 Mining frequent closed itemsets 4 PFC algorithm 5 Conclusion

39 ❺ Conclusion Conclusion P. 32/32 PFC: a scalable algorithm PFC algorithm can be used to generate FCIs Preliminary performance results show that PFC is suitable for large data One limitation of PFC: formal context can be loaded in memory Perspectives Analyze context characteristic Optimize the parameter of algorithms Parallel and distributed environment Application

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion Outline Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Introduction Algorithm Apriori Algorithm AprioriTid Comparison of Algorithms Conclusion Presenter: Dan Li Discussion: