CSE4334/5334 Data Mining Association Rule Mining. Chengkai Li University of Texas at Arlington Fall PDF Free Download

CSE4334/5334 Data Mining Assciatin Rule Mining Chengkai Li University f Texas at Arlingtn Fall 27

Assciatin Rule Mining Given a set f transactins, find rules that will predict the ccurrence f an item based n the ccurrences f ther items in the transactin Market-Basket transactins TID Items Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Cke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Cke Example f Assciatin Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Cke}, {Beer, Bread} {Milk}, Implicatin means c-ccurrence, nt causality!

Definitin: Frequent Itemset Itemset A cllectin f ne r mre items Example: {Milk, Bread, Diaper} k-itemset An itemset that cntains k items Supprt cunt (s) Frequency f ccurrence f an itemset E.g. s({milk, Bread,Diaper}) = 2 Supprt Fractin f transactins that cntain an itemset E.g. s({milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whse supprt is greater than r equal t a minsup threshld TID Items Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Cke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Cke

Definitin: Assciatin Rule Assciatin Rule An implicatin expressin f the frm X Y, where X and Y are itemsets Example: {Milk, Diaper} {Beer} Rule Evaluatin Metrics Supprt (s) u Fractin f transactins that cntain bth X and Y Cnfidence (c) u Measures hw ften items in Y appear in transactins that cntain X TID Example: Items Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Cke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Cke { Milk, Diaper} Þ (Milk, Diaper,Beer) s = s = T s (Milk,Diaper,Beer) c = = s (Milk,Diaper) 2 3 2 5 = Beer =.4.67

Assciatin Rule Mining Task Given a set f transactins T, the gal f assciatin rule mining is t find all rules having supprt minsup threshld cnfidence mincnf threshld Brute-frce apprach: List all pssible assciatin rules Cmpute the supprt and cnfidence fr each rule Prune rules that fail the minsup and mincnf threshlds Þ Cmputatinally prhibitive!

Mining Assciatin Rules TID Items Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Cke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Cke Example f Rules: {Milk,Diaper} {Beer} (s=.4, c=.67) {Milk,Beer} {Diaper} (s=.4, c=.) {Diaper,Beer} {Milk} (s=.4, c=.67) {Beer} {Milk,Diaper} (s=.4, c=.67) {Diaper} {Milk,Beer} (s=.4, c=.5) {Milk} {Diaper,Beer} (s=.4, c=.5) Observatins: All the abve rules are binary partitins f the same itemset: {Milk, Diaper, Beer} Rules riginating frm the same itemset have identical supprt but can have different cnfidence Thus, we may decuple the supprt and cnfidence requirements

Mining Assciatin Rules Tw-step apprach:. Frequent Itemset Generatin Generate all itemsets whse supprt ³ minsup 2. Rule Generatin Generate high cnfidence rules frm each frequent itemset, where each rule is a binary partitining f a frequent itemset Frequent itemset generatin is still cmputatinally expensive

Frequent Itemset Generatin null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2 d pssible candidate itemsets

Frequent Itemset Generatin Brute-frce apprach: Each itemset in the lattice is a candidate frequent itemset Cunt the supprt f each candidate by scanning the database N Transactins TID Items Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Cke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Cke w List f Candidates Match each transactin against every candidate Cmplexity ~ O(NMw) => Expensive since M = 2 d!!! M

Cmputatinal Cmplexity Given d unique items: Ttal number f itemsets = 2 d Ttal number f pssible assciatin rules: 2 3 + - = ú û ù ê ë é ø ö ç è æ - ø ö ç è æ = + - = - = å å d d d k k d j j k d k d R If d=6, R = 62 rules

Frequent Itemset Generatin Strategies Reduce the number f candidates (M) Cmplete search: M=2 d Use pruning techniques t reduce M Reduce the number f transactins (N) Reduce size f N as the size f itemset increases Used by DHP and vertical-based mining algrithms Reduce the number f cmparisns (NM) Use efficient data structures t stre the candidates r transactins N need t match every candidate against every transactin

Reducing Number f Candidates Apriri principle: If an itemset is frequent, then all f its subsets must als be frequent Apriri principle hlds due t the fllwing prperty f the supprt measure: " X, Y : ( X Í Y ) Þ s( X ) ³ s( Y Supprt f an itemset never exceeds the supprt f its subsets This is knwn as the anti-mntne prperty f supprt )

Illustrating Apriri Principle null A B C D E AB AC AD AE BC BD BE CD CE DE Fund t be Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned supersets ABCDE

Illustrating Apriri Principle Item Cunt Bread 4 Cke 2 Milk 4 Beer 3 Diaper 4 Eggs Minimum Supprt = 3 Items (-itemsets) Itemset Cunt {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Pairs (2-itemsets) (N need t generate candidates invlving Cke r Eggs) Triplets (3-itemsets) If every subset is cnsidered, 6 C + 6 C 2 + 6 C 3 = 4 With supprt-based pruning, 6 + 6 + = 3 Itemset Cunt {Bread,Milk,Diaper} 3

Apriri Algrithm Methd: Let k= Generate frequent itemsets f length Repeat until n new frequent itemsets are identified Generate length (k+) candidate itemsets frm length k frequent itemsets Prune candidate itemsets cntaining subsets f length k that are infrequent Cunt the supprt f each candidate by scanning the DB Eliminate candidates that are infrequent, leaving nly thse that are frequent

Anther Example Database TDB Tid Items A, B, D 2 A, C, E 3 A, B, C, E 4 C, E Sup min = 2 st scan Itemset sup {A} 3 L C {B} 2 {C} 3 {D} {E} 3 C 2 Itemset sup C 2 {A, B} 2 L 2 Itemset sup 2 nd scan {A, C} 2 {A, B} 2 {A, E} 2 {A, C} 2 {B, C} {A, E} 2 {B, E} {C, E} 3 {C, E} 3 Itemset sup {A} 3 {B} 2 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C 3 Itemset 3 rd scan L 3 {A, C, E} Itemset sup {A, C, E} 2 6

Apriri Algrithm Let k= Generate frequent itemsets f length Repeat until n new frequent itemsets are identified. Generate candidate (k+)-itemsets frm frequent k- itemsets 2. Prune candidate (k+)-itemsets cntaining sme infrequent k-itemset 3. Cunt the supprt f each candidate by scanning the DB 4. Eliminate infrequent candidates, leaving nly thse that are frequent 7

. Generate Candidate (k+) itemsets Sup min = 2 Input: frequent k-itemsets L k L 2 Itemset sup {A, B} 2 {A, C} 2 {A, E} 2 {C, E} 3 Output: frequent (k+)-itemsets L k+ Prcedure:. Candidate generatin, by self-jin L k *L k q Fr each pair f P={p, p 2,, p k }ÎL k, q={q, q 2,, q k } ÎL k, C 3 Itemset {A, B, C} {A, B, E} {A, C, E} if p =q,, p k- =q k-, p k < q k, add {p,, p k-, p k, q k} int C k+ Example: L 2 ={AB, AC, AE, CE} n AB and AC => ABC n AB and AE => ABE n AC and AE => ACE 8

2. Prune Candidates L 2 Sup min = 2 Itemset sup {A, B} 2 {A, C} 2 {A, E} 2 {C, E} 3 Input: frequent k-itemsets L k Output: frequent (k+)-itemsets L k+ Prcedure:. Candidate generatin, by self-jin L k *L k q Fr each pair f P={p, p 2,, p k }ÎL 2, q={q, q 2,, q k } ÎL 2, if p =q,, p k- =q k-, p k < q k, add {p,, p k-, p k, q k} int C k+ C 3 Itemset {A, B, C} {A, B, E} {A, C, E} 2. Prune candidates that cntain infrequent k-itemsets Example: L 2 ={AB, AC, AE, CE} n n n AB and AC => ABC, pruned because BC is nt frequent AB and AE => ABE, pruned because BE is nt frequent AC and AE => ACE 9

3. Cunt supprt f candidates and 4. Eliminate infrequent candidates Sup min = 2 Input: frequent k-itemsets L k L 2 Itemset sup {A, B} 2 {A, C} 2 {A, E} 2 {C, E} 3 Output: frequent (k+)-itemsets L k+ Prcedure:. Candidate generatin, by self-jin L k *L k q Fr each pair f P={p, p 2,, p k }ÎL 2, q={q, q 2,, q k } ÎL 2, C 3 3 rd scan Itemset {A, C, E} if p =q,, p k- =q k-, p k < q k, add {p,, p k-, p k, q k} int C k+ 2. Prune candidates that cntain infrequent k-itemsets L 3 Itemset sup {A, C, E} 2 3. Cunt the supprt f each candidate by scanning the DB 4. Eliminate infrequent candidates 2

Reducing Number f Cmparisns Candidate cunting: Scan the database f transactins t determine the supprt f each candidate itemset T reduce the number f cmparisns, stre the candidates in a hash structure Instead f matching each transactin against every candidate, match it against candidates cntained in the hashed buckets Transactins Hash Structure N TID Items Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Cke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Cke k Buckets

Generate Hash Tree Suppse yu have 5 candidate itemsets f length 3: { 4 5}, { 2 4}, {4 5 7}, { 2 5}, {4 5 8}, { 5 9}, { 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} Yu need: Hash functin Max leaf size: max number f itemsets stred in a leaf nde (if number f candidate itemsets exceeds max leaf size, split the nde) Hash functin 3,6,9,4,7 2,5,8 4 5 2 4 4 5 7 2 5 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 6 3 5 7 6 8 9 5 9 3 6 7 3 6 8

Assciatin Rule Discvery: Hash tree Hash Functin Candidate Hash Tree,4,7 3,6,9 2,5,8 Hash n, 4 r 7 4 5 3 6 2 4 2 5 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8

Assciatin Rule Discvery: Hash tree Hash Functin Candidate Hash Tree,4,7 3,6,9 2,5,8 Hash n 2, 5 r 8 4 5 3 6 2 4 2 5 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8

Assciatin Rule Discvery: Hash tree Hash Functin Candidate Hash Tree,4,7 3,6,9 2,5,8 Hash n 3, 6 r 9 4 5 3 6 2 4 2 5 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8

Subset Operatin Given a transactin t, what are the pssible subsets f size 3? Transactin, t 2 3 5 6 Level 2 3 5 6 2 3 5 6 3 5 6 Level 2 2 3 5 6 3 5 6 5 6 2 3 5 6 2 5 6 3 5 6 2 3 2 5 2 6 3 5 3 6 5 6 2 3 5 2 3 6 2 5 6 3 5 6 Level 3 Subsets f 3 items

Subset Operatin Using Hash Tree 2 3 5 6 transactin Hash Functin + 2 3 5 6 2 + 3 5 6,4,7 3,6,9 3 + 5 6 2,5,8 2 3 4 5 6 7 4 5 3 6 2 4 2 5 5 9 4 5 7 4 5 8 3 4 5 3 5 6 3 6 7 3 5 7 3 6 8 6 8 9

Subset Operatin Using Hash Tree 2 3 5 6 transactin Hash Functin 2 + 3 + 3 5 6 5 6 + 2 3 5 6 2 + 3 5 6 3 + 5 6,4,7 2,5,8 3,6,9 5 + 6 2 3 4 5 6 7 4 5 3 6 2 4 2 5 5 9 4 5 7 4 5 8 3 4 5 3 5 6 3 6 7 3 5 7 3 6 8 6 8 9

Subset Operatin Using Hash Tree 2 3 5 6 transactin Hash Functin 2 + 3 + 3 5 6 5 6 + 2 3 5 6 2 + 3 5 6 3 + 5 6,4,7 2,5,8 3,6,9 5 + 6 2 3 4 5 6 7 4 5 3 6 2 4 2 5 5 9 4 5 7 4 5 8 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 Match transactin against ut f 5 candidates

Factrs Affecting Cmplexity Chice f minimum supprt threshld lwering supprt threshld results in mre frequent itemsets this may increase number f candidates and max length f frequent itemsets Dimensinality (number f items) f the data set mre space is needed t stre supprt cunt f each item if number f frequent items als increases, bth cmputatin and I/O csts may als increase Size f database since Apriri makes multiple passes, run time f algrithm may increase with number f transactins Average transactin width transactin width increases with denser data sets This may increase max length f frequent itemsets and traversals f hash tree (number f subsets in a transactin increases with its width)

Rule Generatin Given a frequent itemset L, find all nn-empty subsets f Ì L such that f L f satisfies the minimum cnfidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If L = k, then there are 2 k 2 candidate assciatin rules (ignring L Æ and Æ L)

Rule Generatin Hw t efficiently generate rules frm frequent itemsets? In general, cnfidence des nt have an anti-mntne prperty c(abc D) can be larger r smaller than c(ab D) But cnfidence f rules generated frm the same itemset has an anti-mntne prperty e.g., L = {A,B,C,D}: c(abc D) ³ c(ab CD) ³ c(a BCD) Cnfidence is anti-mntne w.r.t. number f items n the RHS f the rule

Rule Generatin fr Apriri Algrithm Lattice f rules Lw Cnfidence Rule ABCD=>{ } BCD=>A ACD=>B ABD=>C ABC=>D CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD Pruned Rules D=>ABC C=>ABD B=>ACD A=>BCD

Rule Generatin fr Apriri Algrithm Candidate rule is generated by merging tw rules that share the same prefix in the rule cnsequent jin(cd=>ab,bd=>ac) wuld prduce the candidate rule D => ABC CD=>AB BD=>AC Prune rule D=>ABC if its subset AD=>BC des nt have high cnfidence D=>ABC

Pattern Evaluatin Assciatin rule algrithms tend t prduce t many rules many f them are uninteresting r redundant Redundant if {A,B,C} {D} and {A,B} {D} have same supprt & cnfidence Interestingness measures can be used t prune/rank the derived patterns In the riginal frmulatin f assciatin rules, supprt & cnfidence are the nly measures used

Cmputing Interestingness Measure Given a rule X Y, infrmatin needed t cmpute rule interestingness can be btained frm a cntingency table Cntingency table fr X Y Y Y X f f f + X f f f + f + f + T f : supprt f X and Y f : supprt f X and Y f : supprt f X and Y f : supprt f X and Y Used t define varius measures supprt, cnfidence, lift, Gini, J-measure, etc.

Drawback f Cnfidence Cffee Cffee Tea 5 5 2 Tea 75 5 8 9 Assciatin Rule: Tea Cffee Cnfidence= P(Cffee Tea) =.75 but P(Cffee) =.9 Þ Althugh cnfidence is high, rule is misleading Þ P(Cffee Tea) =.9375

Statistical Independence Ppulatin f students 6 students knw hw t swim (S) 7 students knw hw t bike (B) 42 students knw hw t swim and bike (S,B) P(SÙB) = 42/ =.42 P(S) P(B) =.6.7 =.42 P(SÙB) = P(S) P(B) => Statistical independence P(SÙB) > P(S) P(B) => Psitively crrelated P(SÙB) < P(S) P(B) => Negatively crrelated

Statistical-based Measures Measures that take int accunt statistical dependence )] ( )[ ( )] ( )[ ( ) ( ) ( ), ( ) ( ) ( ), ( ) ( ) ( ), ( ) ( ) ( Y P Y P X P X P Y P X P Y X P cefficient Y P X P Y X P PS Y P X P Y X P Interest Y P X Y P Lift - - - = - - = = = f

Example: Lift/Interest Cffee Cffee Tea 5 5 2 Tea 75 5 8 9 Assciatin Rule: Tea Cffee Cnfidence= P(Cffee Tea) =.75 but P(Cffee) =.9 Þ Lift =.75/.9=.8333 (<, therefre is negatively assciated)

Drawback f Lift & Interest Y Y X X 9 9 9 Y Y X 9 9 X 9. Lift = =.9 Lift = =. (.)(.) (.9)(.9) Statistical independence: If P(X,Y)=P(X)P(Y) => Lift =

There are lts f measures prpsed in the literature Sme measures are gd fr certain applicatins, but nt fr thers What criteria shuld we use t determine whether a measure is gd r bad? What abut Apriristyle supprt based pruning? Hw des it affect these measures?

Prperties f A Gd Measure Piatetsky-Shapir: 3 prperties a gd measure M must satisfy: M(A,B) = if A and B are statistically independent M(A,B) increase mntnically with P(A,B) when P(A) and P(B) remain unchanged M(A,B) decreases mntnically with P(A) [r P(B)] when P(A,B) and P(B) [r P(A)] remain unchanged

Cmparing Different Measures examples f cntingency tables: Rankings f cntingency tables using varius measures: Example f f f f E 823 83 424 37 E2 833 2 622 46 E3 948 94 27 298 E4 3954 38 5 296 E5 2886 363 32 443 E6 5 2 5 6 E7 4 2 3 E8 4 2 2 2 E9 72 72 5 54 E 6 2483 4 7452

Prperty under Variable Permutatin B B A p q A r s A A B p r B q s Des M(A,B) = M(B,A)? Symmetric measures: supprt, lift, cllective strength, csine, Jaccard, etc Asymmetric measures: cnfidence, cnvictin, Laplace, J-measure, etc

Prperty under Rw/Clumn Scaling Grade-Gender Example (Msteller, 968): Male Female High 2 3 5 Lw 4 5 3 7 Male Female High 4 3 34 Lw 2 4 42 6 7 76 2x x Msteller: Underlying assciatin shuld be independent f the relative number f male and female students in the samples

Prperty under Inversin Operatin A B C D (a) (b) (c) E F Transactin Transactin N.....

Example: f-cefficient f-cefficient is analgus t crrelatin cefficient fr cntinuus variables Y Y X 6 7 X 2 3 7 3 Y Y X 2 3 X 6 7 3 7 f = =.6 -.7.7.7.3.7.3.5238 f = f Cefficient is the same fr bth tables =.2 -.3.3.7.3.7.3.5238

Prperty under Null Additin B B A p q A r s B B A p q A r s + k Invariant measures: supprt, csine, Jaccard, etc Nn-invariant measures: crrelatin, Gini, mutual infrmatin, dds rati, etc

Different Measures have Different Prperties Sym bl Measure Range P P2 P3 O O2 O3 O3' O4 F Crrelatin - Yes Yes Yes Yes N Yes Yes N l Lambda Yes N N Yes N N* Yes N a Odds rati Yes* Yes Yes Yes Yes Yes* Yes N Q Yule's Q - Yes Yes Yes Yes Yes Yes Yes N Y Yule's Y - Yes Yes Yes Yes Yes Yes Yes N k Chen's - Yes Yes Yes Yes N N Yes N M Mutual Infrmatin Yes Yes Yes Yes N N* Yes N J J-Measure Yes N N N N N N N G Gini Index Yes N N N N N* Yes N s Supprt N Yes N Yes N N N N c Cnfidence N Yes N Yes N N N Yes L Laplace N Yes N Yes N N N N V Cnvictin.5 N Yes N Yes** N N Yes N I Interest Yes* Yes Yes Yes N N N N IS IS (csine).. N Yes Yes Yes N N N Yes PS Piatetsky-Shapir's -.25.25 Yes Yes Yes Yes N Yes Yes N F Certainty factr - Yes Yes Yes N N N Yes N AV Added value.5 Yes Yes Yes N N N N N S Cllective strength N Yes Yes Yes N Yes* Yes N z Jaccard.. N Yes Yes Yes N N N Yes æ 2 ö K Klsgen's ç 2 æ ö - ç2-3 -!! Yes Yes Yes N N N N N 3 è øè 3 ø 3 3

CSE4334/5334 Data Mining Association Rule Mining. Chengkai Li University of Texas at Arlington Fall 2017