ì 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM COCONUT Team IMAGINA 16/17 Webpage: h;p://www.lirmm.fr/~lazaar/teaching.html Email: lazaar@lirmm.fr
2 Data Mining ì Data Mining (DM) or Knowledge Discovery in Databases (KDD) revolves around the invessgason and creason of knowledge, processes, algorithms, and the mechanisms for retrieving poten9al knowledge from data collec9ons.
3 Game Data Mining ì Data about players behavior, server performance, system funcsonality ì How to convert these data into something meaningful? ì How to move from raw data to acsonable insights? è Game data mining is the answer www.kdnuggets.com/2010/08/video-tutorial-chrissan-thuraudata-mining-in-games.html
4 Frequent Itemset Mining: Motivations Frequent Itemset Mining is a method for market basket analysis. It aims at finding regularises in the shopping behavior of customers of supermarkets, mail-order companies, on-line shops etc. ì ì ì More specifically: Find sets of products that are frequently bought together. Possible applicasons of found frequent itemsets: ì Improve arrangement of products in shelves, on a catalog s pages etc. ì Support cross-selling (suggesson of other products), product bundling. ì Fraud detecson, technical dependence analysis, fault localizason etc. Oben found pa;erns are expressed as associason rules, for example: ì If a customer buys bread and wine, then she/he will probably also buy cheese.
5 Frequent Itemset Mining: Basic notions ì Items: ì Itemset: ì TransacSons: ì Language of itemsets: ì TransacSonal dataset: ì Cover of an itemset: ì (absolute) Frequency: I = {i 1,...i n } P I T = {t 1,...t m } L I =2 I D I T S(P )={t i 2 T P t i } freq(p )= S(P)
6 Absolute/relative frequency ì Absolute Frequency: freq(p )= S(P ) ì RelaSve Frequency: freq(p )= 1 m S(P )
7 Frequent Itemset Mining: Definition ì Given: ì ì ì A set of items I = {i 1,...i n } A set of transacsons overs the items A minimum support T = {t 1,...t m } ì The need: ì The set of itemset P s.t.: freq(p )
8 Example (1) I = {a, b, c, d, e} T = {1...10} H D V D M D S(bc) ={2, 7, 9} freq(bc) =3 Q: is it an absolute or relasve frenquency?
9 Example (1) ;
10 Example (1) ; 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 0
11 Example (1) Frequent itemset? ; 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 0
12 Example (1) Frequent itemset with minimum support θ=3? ; 7 3 7 6 7 4 5 6 3 1 1 4 4 4 3 3 4 1 1 2 2
13 Searching for Frequent Itemsets ì A naïve search that consists of enumerasng and tessng the frequency of itemset candidates in a given dataset is usually infeasible. ì Why? Number of items (n) Search space (2 n ) 10 10 3 20 10 6 30 10 9 100 10 30 128 10 68 (atoms in the universe) 1000 10 301
14 Anti-monotonicity property ì Given a transacson database D over items I and two itemsets X, Y: ì That is, X Y ) S(Y ) S(X) X Y ) freq(y ) apple freq(x)
15 Example (2) S(ade) ={1, 4, 8, 10} freq(ade) =4 S(acde) ={4, 8} freq(acde) =2 7 3 ; 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 ade 2 0 0 0 0 2 acde 0
16 Apriori property ì Given a transacson database D over items I, a minsup θ and two itemsets X, Y: X Y ) freq(y ) apple freq(x) ì It follows: X Y ) (freq(y ) ) freq(x) ) All subsets of a frequent itemset are frequent! ì ContraposiSon: X Y ) (freq(x) < ) freq(y ) < ) All supersets of an infrequent itemset are infrequent!
17 Example (3) All subsets of a frequent itemset are frequent! =2 7 3 7 6 7 a c d e ; 0 4 5 6 3 1 1 4 4 4 ac ad ae cd ce de 0 0 0 3 3 4 1 1 0 acd ace ade 2 cde 0 0 0 0 2 acde 0
18 Example (3) All supersets of an infrequent itemset are infrequent! =2 7 3 ; 7 6 7 0 4 5 6 3 1 1 4 4 4 be 0 0 0 3 3 4 1 1 0 abe bce bde 2 0 0 0 abce abde 2 0 bcde 0 abcde
19 Partially ordered sets ì A parsal order is a binary relason R over a set S : 8x, y, z 2 S x R x x R y ^ y R x ) x = y x R y ^ y R z ) x R z (reflexivity) (ans-symmetry) (transisvity) ; S =? R =?
20 Poset (2 I, ) ì Comparable itemsets: ì Incomparable itemsets: x y _ y x x 6 y ^ y 6 x ;
21 Apriori Algorithm [Agrawal and Srikant 1994] ì Determine the support of the one-element item sets (i.e. singletons) and discard the infrequent items. ì Form candidate itemsets with two items (both items must be frequent), determine their support, and discard the infrequent itemsets. ì Form candidate item sets with three items (all contained pairs must be frequent), determine their support, and discard the infrequent itemsets. ì And so on! Based on candidate generason and pruning
Apriori Algorithm [Agrawal and Srikant 1994] 22
23 Apriori candidates generation apriori gen(l k ): E ; 8P 0,P 00 2 L k s.t. : (P 0 = {i 1,...,i k 1,i k }) ^ (P 00 = {i 1,...,i k 1,i 0 k }) P P 0 [ P 00 //{i 1,...,i k 1,i k,i 0 k } if 8i 2 P : P \{i} 2 L k then E E [ {P } return E
24 Improving candidates generation ì Using apriori-gen funcson, an item of k+1 size can be generated in a j possible ways: j = k(k+1) 2 ì Need: Generate itemset candidate at most once. ì How: Assign to each itemset a unique parent itemset, from which this itemset is to be generated
25 Improving candidates generation ì Assigning unique parents turns the poset latce into a tree:
26 Canonical form for itemsets ì An itemset can be represented as a word over an alphabet I ì Q: how many words of 3 items can we have? Of 4 items? Of k items? k! ì An arbitrary order (e.g., lexicography order) on items can give a canonical form, a unique representason of itemsets by breaking symmetries. ì Lex on items : abc < acb < bac < bca...
Recursive processing with Canonical 27 forms ì Foreach P of a given level, generate all possible extension of P by one item such that: child(p )={P 0 :(i/2 P ) ^ (P 0 = P [ {i}) ^(c(p ).last < i) ^ (P 0 is frequent)} ì Foreach P, process it recursively.
28 Example (4) Q: what are the children of: 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 child(p )={P 0 :(i/2 P ) ^ (P 0 = P [ {i}) ^(c(p 0 ).last < i) ^ (P 0 is frequent)} 0
29 Items Ordering ì Any order can be used, that is, the order is arbitrary ì The search space differs considerably depending on the order ì Thus, the efficiency of the Frequent Itemset Mining algorithms can differ considerably depending on the item order ì Advanced methods even adapt the order of the items during the search: use different, but compasble orders in different branches
30 Items Ordering (heuristics) ì Frequent itemsets consist of frequent items ì Sort the items w.r.t. their frequency. (decreasing/increasing) ì The sum of transacson sizes, transacson containing a given item, which captures implicitly the frequency of pairs, triplets etc. ì Sort items w.r.t. the sum of the sizes of the transacsons that cover them.
ì 31 Tutorials link: h;p://www.lirmm.fr/~lazaar/imagina/td1.pdf
Condensed representason of Frequent Itemsets: Closed and Maximal Itemsets 32
33 Maximal Itemsets ì The set of Maximal (frequent) Itemsets: M = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) < } ì That is: An item set is maximal if it is frequent, but none of its proper supersets is frequent. 8, 8P 2 F :(P 2 M ) _ (9P 0 P : freq(p 0 ) )
34 Maximal Itemsets ì Every frequent itemset has a maximal superset: 8, 8P 2 F :(9P 0 2 M : P P 0 ) ì The maximal itemsets are a condensed representason of the frequent itemsets where: 8 : F = [ P 2M 2 P
35 Example (5) Here are the Frequent itemset with minsup θ=3 Q: What are the maximal itemsets minsup θ=3? ; 7 3 7 6 7 4 5 6 3 4 4 4 bc 3 3 4 acd ace ade M = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) < }
36 Maximal Itemsets (limitation) The set of maximal itemsets captures the set of all frequent itemsets BUT it does not preserve the knowledge of all support values THE NEED Can we have a condensed representason of the set of frequent itemsets, which preserves knowledge of all support values?
37 Closed Itemsets ì The set of Closed (frequent) Itemsets: C = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) <freq(p )} An item set is closed if it is frequent, but none of its proper supersets has the same support ì That is: 8, 8P 2 F :(P 2 C ) _ (9P 0 P : freq(p 0 )=freq(p ))
38 Closed Itemsets ì Every frequent itemset has a closed superset: 8, 8P 2 F :(9P 0 2 C : P P 0 ) ì The closed itemsets are a condensed representason of the frequent itemsets where: 8 : F = [ P 2C 2 P
39 Closed Itemsets ì Every frequent itemset has a closed superset with the same support ì The set of all closed itemsets preserves knowledge of all support values: 8, 8P 2 F : S(P ) = max P 0 2C,P 0 P S(P 0 ) ì Which is not the case with the maximal itemsets: 8, 8P 2 F : S(P ) max P 0 2M,P 0 P S(P 0 )
40 Example (6) Here are the Frequent itemset with minsup θ=3 Q: are b and de Closed itemsets? ; 7 3 b 7 6 7 4 5 6 3 4 4 4 de 3 3 4 C = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) <freq(p )}
41 Maximal/Closed itemsets maximal closed Q: Can the Closed itemsets be expressed using maximal ones? [ C = i2{,...,n} M i
42 Frequent/Closed/Maximal Dataset #Frequent #Closed #Maximal Zoo-1 151 807 3 292 230 Mushroom 155 734 3 287 453 Lymph 9 967 402 46 802 5 191 HepaSSs 27. 10 7 + 1 827 264 189 205 h;ps://dtai.cs.kuleuven.be/cp4im/datasets/
ì 43 Tutorials link: h;p://www.lirmm.fr/~lazaar/imagina/td2.pdf