Chapter 6: Miig Frequet Patters, Associatio ad Correlatios Basic cocepts Frequet itemset miig methods Costrait-based frequet patter miig (ch7) Associatio rules 1
What Is Frequet Patter Aalysis? Frequet patter: a patter (a set of items, subsequeces, substructures, etc.) that occurs frequetly i a data set First proposed by Agrawal, Imieliski, ad Swami [AIS93] i the cotext of frequet itemsets ad associatio rule miig Motivatio: Fidig iheret regularities i data What products were ofte purchased together? Beer ad diapers?! What are the subsequet purchases after buyig a PC? What kids of DNA are sesitive to this ew drug? Ca we automatically classify web documets? Applicatios Basket data aalysis, cross-marketig, catalog desig, sale campaig aalysis, Web log (click stream) aalysis, ad DNA sequece aalysis. 2
Why Is Freq. Patter Miig Importat? Freq. patter: itrisic ad importat property of data sets Foudatio for may essetial data miig tasks Associatio, correlatio, ad causality aalysis Sequetial, structural (e.g., sub-graph) patters Patter aalysis i spatiotemporal, multimedia, timeseries, ad stream data Classificatio: associative classificatio Cluster aalysis: frequet patter-based clusterig Data warehousig: iceberg cube ad cube-gradiet Sematic data compressio: fascicles Broad applicatios 3
Basic Cocepts: Frequet Patters Tid Customer buys beer Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper itemset: A set of items k-itemset X = {x 1,, x k } (absolute) support, or, support cout of X: Frequecy or occurrece of a itemset X (relative) support, s, is the fractio of trasactios that cotais X (i.e., the probability that a trasactio cotais X) A itemset X is frequet if X s support is o less tha a misup threshold 4
Closed Patters ad Max-Patters A log patter cotais a combiatorial umber of sub-patters, e.g., {a 1,, a 100 } cotais 2 100 1 = 1.27*10 30 sub-patters! Solutio: Mie closed patters ad max-patters istead A itemset X is a closed patter if X is frequet ad there exist o super-patters with the same support all super-patters must have smaller support A itemset X is a max-patter if X is frequet ad there exist o super-patters that are frequet Relatioship betwee the two? Closed patters are a lossless compressio of freq. patters, whereas max-patters are a lossy compressio Lossless: ca derive all frequet patters as well as their support Lossy: ca derive all frequet patters 5
Closed Patters ad Max-Patters DB = {<a 1,, a 100 >, < a 1,, a 50 >} mi_sup = 1 What is the set of closed patters? <a 1,, a 100 >: 1 < a 1,, a 50 >: 2 How to derive frequet patters ad their support values? What is the set of max-patters? <a 1,, a 100 >: 1 How to derive frequet patters? What is the set of all patters? {a 1 }: 2,, {a 1, a 2 }: 2,, {a 1, a 51 }: 1,, {a 1, a 2,, a 100 }: 1 A big umber: 2 100 1 6
Closed Patters ad Max-Patters For a give dataset with itemset I = {a,b,c,d} ad mi_sup = 8, the closed patters are {a,b,c,d} with support of 10, {a,b,c} with support of 12, ad {a, b,d} with support of 14. Derive the frequet 2- itemsets together with their support values {a,b}: 14 {a,c}: 12 {a,d}: 14 {b,c}: 12 {b,d}: 14 {c,d}: 10 7
Chapter 6: Miig Frequet Patters, Associatio ad Correlatios Basic cocepts Frequet itemset miig methods Costrait-based frequet patter miig (ch7) Associatio rules 8
Scalable Frequet Itemset Miig Methods Apriori: A Cadidate Geeratio-ad-Test Approach Improvig the Efficiecy of Apriori FPGrowth: A Frequet Patter-Growth Approach ECLAT: Frequet Patter Miig with Vertical Data Format 9
Scalable Methods for Miig Frequet Patters The dowward closure (ati-mootoic) property of frequet patters Ay subset of a frequet itemset must be frequet If {beer, diaper, uts} is frequet, so is {beer, diaper} i.e., every trasactio havig {beer, diaper, uts} also cotais {beer, diaper} Scalable miig methods: Three major approaches Apriori (Agrawal & Srikat@VLDB 94) Freq. patter growth (Fpgrowth: Ha, Pei & Yi @SIGMOD 00) Vertical data format (Charm Zaki & Hsiao @SDM 02) 10
Apriori: A Cadidate Geeratio-ad-Test Approach Apriori pruig priciple: If there is ay itemset that is ifrequet, its superset should ot be geerated/tested! (Agrawal & Srikat @VLDB 94, Maila, et al. @ KDD 94) Method: Iitially, sca DB oce to get frequet 1-itemset Geerate legth (k+1) cadidate itemsets from legth k frequet itemsets Test the cadidates agaist DB Termiate whe o frequet or cadidate set ca be geerated 11
The Apriori Algorithm A Example Tid DB Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e 1 st sca Itemset sup {a} 2 L C 1 1 {b} 3 {c} 3 {d} 1 {e} 3 C 2 C 2 {a, b} 1 L 2 Itemset sup 2 d sca {a, c} 2 {b, c} 2 {b, e} 3 {c, e} 2 Itemset sup {a, c} 2 {a, e} 1 {b, c} 2 {b, e} 3 {c, e} 2 Itemset sup {a} 2 {b} 3 {c} 3 {e} 3 mi_sup= 2 Itemset {a, b} {a, c} {a, e} {b, c} {b, e} {c, e} C Itemset 3 3 rd sca L 3 {b, c, e} Itemset sup {b, c, e} 2 12
The Apriori Algorithm (Pseudo-code) C k : Cadidate itemset of size k L k : frequet itemset of size k L 1 = {frequet items}; for (k = 1; L k!= ; k++) do begi C k+1 = cadidates geerated from L k ; for each trasactio t i database do icremet the cout of all cadidates i C k+1 that are cotaied i t L k+1 = cadidates i C k+1 with mi_support ed retur k L k ; 13
Implemetatio of Apriori Geerate cadidates, the cout support for the geerated cadidates How to geerate cadidates? Step 1: self-joiig L k Step 2: pruig Example: L 3 ={abc, abd, acd, ace, bcd} Self-joiig: L 3 *L 3 abcd from abc ad abd acde from acd ad ace Pruig: acde is removed because ade is ot i L 3 C 4 ={abcd} The above procedures do ot miss ay legitimate cadidates. Thus Apriori mies a complete set of frequet patters. 14
How to Cout Supports of Cadidates? Why coutig supports of cadidates a problem? The total umber of cadidates ca be very huge Oe trasactio may cotai may cadidates Method: Cadidate itemsets are stored i a hash-tree Leaf ode of hash-tree cotais a list of itemsets ad couts Iterior ode cotais a hash table Subset fuctio: fids all the cadidates cotaied i a trasactio 15
Example: Coutig Supports of Cadidates Subset fuctio 3,6,9 1,4,7 2,5,8 Trasactio: 1 2 3 5 6 1 + 2 3 5 6 1 3 + 5 6 1 2 + 3 5 6 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 2 3 4 5 6 7 1 3 6 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 16
Further Improvemet of the Apriori Method Major computatioal challeges Multiple scas of trasactio database Huge umber of cadidates Tedious workload of support coutig for cadidates Improvig Apriori: geeral ideas Reduce passes of trasactio database scas Shrik umber of cadidates Facilitate support coutig of cadidates 17
Apriori applicatios beyod freq. patter miig Give a set S of studets, we wat to fid each subset of S such that the age rage of the subset is less tha 5. Apriori algorithm, level-wise search usig the dowward closure property for pruig to gai efficiecy Ca be used to search for ay subsets with the dowward closure property (i.e., ati-mootoe costrait) CLIQUE for subspace clusterig used the same Apriori priciple, where the oe-dimesioal cells are the items 18
Chapter 6: Miig Frequet Patters, Associatio ad Correlatios Basic cocepts Frequet itemset miig methods Costrait-based frequet patter miig (ch7) Associatio rules 19
Costrait-based (Query-Directed) Miig Fidig all the patters i a database autoomously? urealistic! The patters could be too may but ot focused! Data miig should be a iteractive process User directs what to be mied usig a data miig query laguage (or a graphical user iterface) Costrait-based miig User flexibility: provides costraits o what to be mied Optimizatio: explores such costraits for efficiet miig costrait-based miig: costrait-pushig, similar to push selectio first i DB query processig Note: still fid all the aswers satisfyig costraits, ot fidig some aswers i heuristic search 20
Costraied Miig vs. Costrait-Based Search Costraied miig vs. costrait-based search/reasoig Both are aimed at reducig search space Fidig all patters satisfyig costraits vs. fidig some (or oe) aswer i costrait-based search i AI Costrait-pushig vs. heuristic search It is a iterestig research problem o how to itegrate them Costraied miig vs. query processig i DBMS Database query processig requires to fid all Costraied patter miig shares a similar philosophy as pushig selectios deeply i query processig
Costrait-Based Frequet Patter Miig Patter space pruig costraits Ati-mootoic: If costrait c is violated, its further miig ca be termiated Mootoic: If c is satisfied, o eed to check c agai Succict: c must be satisfied, so oe ca start with the data sets satisfyig c Covertible: c is ot mootoic or ati-mootoic, but it ca be coverted ito it if items i the trasactio ca be properly ordered Data space pruig costrait Data succict: Data space ca be prued at the iitial patter miig process Data ati-mootoic: If a trasactio t does ot satisfy c, t ca be prued from its further miig 22
Ati-Mootoicity i Costrait Pushig Ati-mootoicity Whe a itemset S violates the costrait, so does ay of its superset sum(s.price) v is ati-mootoic sum(s.price) v is ot ati-mootoic C: rage(s.profit) 15 is ati-mootoic Itemset ab violates C So does every superset of ab support cout >= mi_sup is atimootoic core property used i Apriori TDB (mi_sup=2) TID Trasactio 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
Mootoicity for Costrait Pushig Mootoicity Whe a itemset S satisfies the costrait, so does ay of its superset sum(s.price) v is mootoic mi(s.price) v is mootoic C: rage(s.profit) 15 Itemset ab satisfies C So does every superset of ab Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 24
Succictess Give A 1, the set of items satisfyig a succictess costrait C, the ay set S satisfyig C is based o A 1, i.e., S cotais a subset belogig to A 1 Idea: Without lookig at the trasactio database, whether a itemset S satisfies costrait C ca be determied based o the selectio of items If a costrait is succict, we ca directly geerate precisely the sets that satisfy it, eve before support coutig begis. Avoids substatial overhead of geerate-ad-test, i.e., such costrait is pre-coutig pushable mi(s.price) v is succict sum(s.price) v is ot succict
Costrait-Based Miig A Geeral Picture Costrait Atimootoe Mootoe Succict v S o yes yes S V o yes yes S V yes o yes mi(s) v o yes yes mi(s) v yes o yes max(s) v yes o yes max(s) v o yes yes cout(s) v yes o weakly cout(s) v o yes weakly sum(s) v ( a S, a 0 ) yes o o sum(s) v ( a S, a 0 ) o yes o rage(s) v yes o o rage(s) v o yes o avg(s) θ v, θ { =,, } covertible covertible o support(s) ξ yes o o support(s) ξ o yes o 26
Chapter 6: Miig Frequet Patters, Associatio ad Correlatios Basic cocepts Frequet itemset miig methods Costrait-based frequet patter miig (ch7) Associatio rules 27
Basic Cocepts: Associatio Rules A associatio rule is of the form X à Y, where X,Y I, X Y = φ A rule is strog if it satisfies both support ad cofidece thresholds. support(x->y): probability that a trasactio cotais X Y, i.e., support(x->y) = P(X U Y) Ca be estimated by the percetage of trasactios i DB that cotai X Y. Not to be cofused with P(X or Y) cofidece(x->y): coditioal probability that a trasactio havig X also cotais Y, i.e. cofidece(x->y) = P(Y X) cofidece(x->y) = P(Y X) = support(x Y) / support (X) = support_cout(x Y) / support_cout(x) cofidece(x->y) ca be easily derived from the support cout of X ad the support cout of X Y. Thus associatio rule miig ca be reduced to frequet patter miig 28
Basic Cocepts: Associatio rules Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper Let misup = 50%, micof = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Associatio rules: (may more!) Beer à Diaper (60%, 100%) Diaper à Beer (60%, 75%) Customer buys both Customer buys diaper If {a} => {b} is a associatio rule, the {b} => {a} is also a associatio rule? q Same support, differet cofidece Customer buys beer If {a,b} => {c} is a associatio rule, the {b} => {c} is also a associatio rule? If {b} => {c} is a associatio rule the {a,b} => {c} is also a associatio rule? 29
Iterestigess Measure: Correlatios (Lift) play basketball eat cereal [40%, 66.7%] is misleadig The overall % of studets eatig cereal is 75% > 66.7%. play basketball ot eat cereal [20%, 33.3%] is more accurate, although with lower support ad cofidece Support ad cofidece are ot good to idicate correlatios Measure of depedet/correlated evets: lift P( A B) lift = P( A) P( B) 2000 / 5000 lift( B, C) = = 0.89 3000 / 5000*3750 / 5000 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 1000 / 5000 lift( B, C) = = 1.33 3000 / 5000*1250 / 5000 30