1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM IMAGINA 15/16
2 Frequent Itemset Mining: Motivations Frequent Itemset Mining is a method for market basket analysis. It aims at finding regulariges in the shopping behavior of customers of supermarkets, mail-order companies, on-line shops etc. More specifically: Find sets of products that are frequently bought together. Possible applicagons of found frequent itemsets: Improve arrangement of products in shelves, on a catalog s pages etc. Support cross-selling (suggesgon of other products), product bundling. Fraud detecgon, technical dependence analysis, fault localizagon etc. OUen found paverns are expressed as associagon rules, for example: If a customer buys bread and wine, then she/he will probably also buy cheese.
3 Frequent Itemset Mining: Basic notions Items: Itemset: TransacGons: Language of itemsets: TransacGonal dataset: Cover of an itemset: (absolute) Frequency: I = {i 1,...i n } P I T = {t 1,...t m } L I =2 I D I T S(P )={t i 2 T P t i } freq(p )= S(P)
4 Absolute/relative frequency Absolute Frequency: freq(p )= S(P ) RelaGve Frequency: freq(p )= 1 m S(P )
5 Frequent Itemset Mining: Definition Given: A set of items I = {i 1,...i n } A set of transacgons overs the items A minimum support T = {t 1,...t m } The need: The set of itemset P s.t.: freq(p )
6 Example I = {a, b, c, d, e} T = {1...10} H D V D M D S(bc) ={2, 7, 9} freq(bc) =3 Q: is it an absolute or relagve frenquency?
7 Example ;
8 Example ; 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 0
9 Example Frequent itemset? ; 7 3 7 6 7 4 5 6 3 1 1 4 4 4 3 3 4 1 1 2 2
10 Example Frequent itemset with minimum support θ=3? ; 7 3 7 6 7 4 5 6 3 4 4 4 3 3 4
11 Searching for Frequent Itemsets A naïve search that consists of enumeragng and tesgng the frequency of itemset candidates in a given dataset is usually infeasible. Why? Number of items (n) Search space (2 n ) 10 10 3 20 10 6 30 10 9 100 10 30 128 10 68 (atoms in the universe) 1000 10 301
12 Anti-monotonicity property Given a transacgon database D over items I and two itemsets X, Y: That is, X Y ) S(Y ) S(X) X Y ) freq(y ) apple freq(x)
13 Example S(ade) ={1, 4, 8, 10} freq(ade) =4 S(acde) ={4, 8} freq(acde) =2 7 3 ; 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 ade 2 0 0 0 0 2 acde 0
14 Apriori property Given a transacgon database D over items I, a minsup θ and two itemsets X, Y: X Y ) freq(y ) apple freq(x) It follows: X Y ) (freq(y ) ) freq(x) ) All subsets of a frequent itemset are frequent! ContraposiGon: X Y ) (freq(x) < ) freq(y ) < ) All supersets of an infrequent itemset are infrequent!
15 Example All subsets of a frequent itemset are frequent! =2 7 3 7 6 7 a c d e ; 0 4 5 6 3 1 1 4 4 4 ac ad ae cd ce de 0 0 0 3 3 4 1 1 0 acd ace ade 2 cde 0 0 0 0 2 acde 0
16 Example All supersets of an infrequent itemset are infrequent! =2 7 3 ; 7 6 7 0 4 5 6 3 1 1 4 4 4 be 0 0 0 3 3 4 1 1 0 abe bce bde 2 0 0 0 abce abde 2 0 bcde 0 abcde
17 Partially ordered sets A pargal order is a binary relagon R over a set S : 8x, y, z 2 S x R x x R y ^ y R x ) x = y x R y ^ y R z ) x R z (reflexivity) (ang-symmetry) (transigvity) ; S =? R =?
18 Poset (2 I, ) Comparable itemsets: Incomparable itemsets: x y _ y x x 6 y ^ y 6 x ;
19 Apriori Algorithm [Agrawal and Srikant 1994] Determine the support of the one-element item sets (i.e. singletons) and discard the infrequent items / itemsets. Form candidate itemsets with two items (both items must be frequent), determine their support, and discard the infrequent itemsets. Form candidate item sets with three items (all contained pairs must be frequent), determine their support, and discard the infrequent itemsets. And so on! Based on candidate generagon and pruning
Apriori Algorithm [Agrawal and Srikant 1994] 20
21 Apriori candidates generation apriori-gen(l k ) E ; 8P i,p j 2 L k s.t. : (P i = {i 1,...,i k 1,i k }) ^ (P i = {i 1,...,i k 1,i 0 k }) ^ (i k <i 0 k ) P P 1 [ P 2 //{i 1,...,i k 1,i k,i 0 k } if 8i 2 P : P \{i} 2 L k then E E [ {P } return E
22 Improving candidates generation Using apriori-gen funcgon, an item of k+1 size can be generated in a j possible ways: j = k(k+1) 2 Need: Generate itemset candidate at most once. How: Assign to each itemset a unique parent itemset, from which this itemset is to be generated
23 Improving candidates generation Assigning unique parents turns the poset laqce into a tree:
24 Canonical form for itemsets An itemset can be represented as a word over an alphabet I Q: how many words of 3 items can we have? Of 4 items? Of k items? k! An arbitrary order (e.g., lexicography order) on items can give a canonical form, a unique representagon of itemsets by breaking symmetries. Lex on items : abc < acb < bac < bca...
Recursive processing with Canonical 25 forms Foreach P of a given level, generate all possible extension of P by one item such that: child(p )={P 0 :(i/2 P ) ^ (P 0 = P [ {i}) ^(c(p ).last < i) ^ (P 0 is frequent)} Foreach P, process it recursively.
26 Example (4) Q: what are the children of: 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 child(p )={P 0 :(i/2 P ) ^ (P 0 = P [ {i}) ^(c(p 0 ).last < i) ^ (P 0 is frequent)} 0
27 Items Ordering Any order can be used, that is, the order is arbitrary The search space differs considerably depending on the order Thus, the efficiency of the Frequent Itemset Mining algorithms can differ considerably depending on the item order Advanced methods even adapt the order of the items during the search: use different, but compagble orders in different branches
28 Items Ordering (heuristics) Frequent itemsets consist of frequent items Sort the items w.r.t. their frequency. (decreasing/increasing) The sum of transacgon sizes, transacgon containing a given item, which captures implicitly the frequency of pairs, triplets etc. Sort items w.r.t. the sum of the sizes of the transacgons that cover them.
Condensed representagon of Frequent Itemsets: Closed and Maximal Itemsets 29
30 Maximal Itemsets The set of Maximal (frequent) Itemsets: M = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) < } That is: An item set is maximal if it is frequent, but none of its proper supersets is frequent. 8, 8P 2 F :(P 2 M ) _ (9P 0 P : freq(p 0 ) )
31 Maximal Itemsets Every frequent itemset has a maximal superset: 8, 8P 2 F :(9P 0 2 M : P P 0 ) The maximal itemsets are a condensed representagon of the frequent itemsets where: 8 : F = [ P 2M 2 P
32 Example (5) Here are the Frequent itemset with minsup θ=3 Q: What are the maximal itemsets minsup θ=3? ; 7 3 7 6 7 4 5 6 3 4 4 4 bc 3 3 4 acd ace ade M = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) < }
33 Maximal Itemsets (limitation) The set of maximal itemsets captures the set of all frequent itemsets BUT it does not preserve the knowledge of all support values THE NEED Can we have a condensed representagon of the set of frequent itemsets, which preserves knowledge of all support values?
34 Closed Itemsets The set of Closed (frequent) Itemsets: C = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) <freq(p )} An item set is closed if it is frequent, but none of its proper supersets has the same support That is: 8, 8P 2 F :(P 2 C ) _ (9P 0 P : freq(p 0 )=freq(p ))
35 Closed Itemsets Every frequent itemset has a closed superset: 8, 8P 2 F :(9P 0 2 C : P P 0 ) The closed itemsets are a condensed representagon of the frequent itemsets where: 8 : F = [ P 2C 2 P
36 Closed Itemsets Every frequent itemset has a closed superset with the same support The set of all closed itemsets preserves knowledge of all support values: 8, 8P 2 F : S(P ) = max P 0 2C,P 0 P S(P 0 ) Which is not the case with the maximal itemsets: 8, 8P 2 F : S(P ) max P 0 2M,P 0 P S(P 0 )
37 Example (6) Here are the Frequent itemset with minsup θ=3 Q: are b and de Closed itemsets? ; 7 3 b 7 6 7 4 5 6 3 4 4 4 de 3 3 4 C = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) <freq(p )}
38 Maximal/Closed itemsets maximal closed Q: Can the Closed itemsets be expressed using maximal ones? [ C = i2{,...,n} M i
39 Frequent/Closed/Maximal Dataset #Frequent #Closed #Maximal Zoo-1 151 807 3 292 230 Mushroom 155 734 3 287 453 Lymph 9 967 402 46 802 5 191 HepaGGs 27. 10 7 + 1 827 264 189 205 hvps://dtai.cs.kuleuven.be/cp4im/datasets/
40 LCM Algorithm Linear Closed Item Set Miner [Uno et al., 03] (version 1) [Uno et al., 04, 05] (versions 2 & 3)
41 LCM: basic ideas The itemset candidates are checked in lexicographic order (depthfirst traversal of the prefix tree) Step by step eliminagon of items from the transacgon database; recursive processing of the condigonal transacgon databases Maintains both a horizontal and a vergcal representagon of the transacgon database in parallel. Uses the vergcal representagon to filter the transacgons with the chosen split item. Uses the horizontal representagon to fill the vergcal representagon for the next recursion step (no intersecgon as in Eclat algorithm).
42 LCM: basic ideas The itemset candidates are checked in lexicographic order (depth-first traversal of the prefix tree) Let X = {x 1,...,x n } be an itemset: x 1 <...<x n tail(x) =x n X(i) ={x 1,...,x i } X[j] =X [ {j} Xprefix of Y, X = Y (i) ^ i = tail(x)
43 LCM: basic ideas Parent-child relagon P is defined as: X = P(Y ), (Y = X [ {x i }) ^ (x i >tail(x)) Or equivalently: X = P(Y ), X = Y \ tail(y ) P P is an acyclic relagon
44 Example (7) bcd and cda are candidates? tail(abde)=?, tail(a)=?, tail(bd)=? Let X= abde X(1)=?, X(2)=? X(3)=?, X(4)=? X(5)=X(i: i>3)=? P(bc)=?, P(ade)=?, P(c)=? P
45 Example (7) bcd is a candidate, cda is not tail(abde)=e, tail(a)=a, tail(bd)=d Let X= abde X(1)=a, X(2)=ab X(3)=abd, X(4)=abde X(5)=X(i: i>3)=x(4) P(bc)=b, P(ade)=ad, P(c)=emptyset P
46 Closure Extension [Pasquier et al., 99] Closure extension is a rule for construcgng a closed itemset from another one Add an item and take its closure! closure Closed itemset + item
47 LCM: Lemma 1 Let X and X 0 two itemsets: { freq(x 0 ) > 0 X 0 is child of X, X is a prefix of X 0 X 0 = [ i2s(x 0 ) t i condigon 1 condigon 2 condigon 3
48 Example (8) { freq(x 0 ) > 0 X 0 is child of X, X is a prefix of X 0 X 0 = [ i2s(x 0 ) t i 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 0
49 LCM: Algorithm tail(x) [ j2s(x[i]) t j [Uno et al., 03]
50 Diffsets [Zaki and Hsiao, 02] Given an itemset X and its extension X[i] S(X[i]) = S(X [ {i}) diffset(x, i) =S(X) \ S(X[i]) freq(x[i]) = S(X) diffset(x, i) Example (9) Let S(X)={1,4,6,7,9} and S(X[i])={1,6,9} Diffset(X,i)={4,7} Freq(X[i])= 5-2=3
Some results (1/2) 51
Some results (2/2) 52