Chapter 6: Mining Frequent Patterns, Association and Correlations

Similar documents
COMP9318: Data Warehousing and Data Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

FP-growth and PrefixSpan

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

FP-growth and PrefixSpan

Chapters 6 & 7, Frequent Pattern Mining

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Unit II Association Rules

CS 412 Intro. to Data Mining

Exercises Advanced Data Mining: Solutions

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS 484 Data Mining. Association Rule Mining 2

( ) GENERATING FUNCTIONS

732A61/TDDD41 Data Mining - Clustering and Association Analysis

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

( ) = p and P( i = b) = q.

Lecture 4 February 16, 2016

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

Signals & Systems Chapter3

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

Zeros of Polynomials

Math 475, Problem Set #12: Answers

COMP 5331: Knowledge Discovery and Data Mining

Recurrence Relations

CS 584 Data Mining. Association Rule Mining 2

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Section 5.1 The Basics of Counting

1 Review of Probability & Statistics

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

Frequency Domain Filtering

Data Mining Concepts & Techniques

Permutations & Combinations. Dr Patrick Chan. Multiplication / Addition Principle Inclusion-Exclusion Principle Permutation / Combination

IP Reference guide for integer programming formulations.

Infinite Sequences and Series

Properties and Tests of Zeros of Polynomial Functions

AN EFFICIENT PROCEDURE FOR MINING STATISTICALLY SIGNIFICANT FREQUENT ITEMSETS. Predrag Stanišić and Savo Tomović

CS284A: Representations and Algorithms in Molecular Biology

Topic 5: Basics of Probability

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

Lecture Overview. 2 Permutations and Combinations. n(n 1) (n (k 1)) = n(n 1) (n k + 1) =

Recursive Algorithm for Generating Partitions of an Integer. 1 Preliminary

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

NUMERICAL METHODS FOR SOLVING EQUATIONS

On Performance Deviation of Binary Search Tree Searches from the Optimal Search Tree Search Structures

Data Mining and Analysis: Fundamental Concepts and Algorithms

Integer Programming (IP)

Linear Regression Analysis. Analysis of paired data and using a given value of one variable to predict the value of the other

Evaluating Data Minability Through Compression An Experimental Study

Lecture 2: April 3, 2013

Random Variables, Sampling and Estimation

DATA MINING - 1DL360

Math 155 (Lecture 3)

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

page Suppose that S 0, 1 1, 2.

Intermediate Math Circles November 4, 2009 Counting II

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Lecture Notes for Analysis Class

Massachusetts Institute of Technology

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Find a formula for the exponential function whose graph is given , 1 2,16 1, 6

Injections, Surjections, and the Pigeonhole Principle

Optimally Sparse SVMs

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Lecture Notes for Chapter 6. Introduction to Data Mining

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Generating Functions. 1 Operations on generating functions

Analysis of Algorithms. Introduction. Contents

6.3 Testing Series With Positive Terms

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials

Feedback in Iterative Algorithms

As stated by Laplace, Probability is common sense reduced to calculation.

DATA MINING - 1DL360

CSE 191, Class Note 05: Counting Methods Computer Sci & Eng Dept SUNY Buffalo

4.3 Growth Rates of Solutions to Recurrences

Calculus 2 Test File Spring Test #1

CS 270 Algorithms. Oliver Kullmann. Growth of Functions. Divide-and- Conquer Min-Max- Problem. Tutorial. Reading from CLRS for week 2

Revision Topic 1: Number and algebra

CHAPTER 2. Mean This is the usual arithmetic mean or average and is equal to the sum of the measurements divided by number of measurements.

Optimization Methods MIT 2.098/6.255/ Final exam

Review for Test 3 Math 1552, Integral Calculus Sections 8.8,

Notes on iteration and Newton s method. Iteration

Fortgeschrittene Datenstrukturen Vorlesung 11

This is an introductory course in Analysis of Variance and Design of Experiments.

Combinatorially Thinking

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Mining Probabilistic Association Rules from Uncertain Databases with Pruning

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Frequentist Inference

Monte Carlo Integration

EECE 301 Signals & Systems

Transcription:

Chapter 6: Miig Frequet Patters, Associatio ad Correlatios Basic cocepts Frequet itemset miig methods Costrait-based frequet patter miig (ch7) Associatio rules 1

What Is Frequet Patter Aalysis? Frequet patter: a patter (a set of items, subsequeces, substructures, etc.) that occurs frequetly i a data set First proposed by Agrawal, Imieliski, ad Swami [AIS93] i the cotext of frequet itemsets ad associatio rule miig Motivatio: Fidig iheret regularities i data What products were ofte purchased together? Beer ad diapers?! What are the subsequet purchases after buyig a PC? What kids of DNA are sesitive to this ew drug? Ca we automatically classify web documets? Applicatios Basket data aalysis, cross-marketig, catalog desig, sale campaig aalysis, Web log (click stream) aalysis, ad DNA sequece aalysis. 2

Why Is Freq. Patter Miig Importat? Freq. patter: itrisic ad importat property of data sets Foudatio for may essetial data miig tasks Associatio, correlatio, ad causality aalysis Sequetial, structural (e.g., sub-graph) patters Patter aalysis i spatiotemporal, multimedia, timeseries, ad stream data Classificatio: associative classificatio Cluster aalysis: frequet patter-based clusterig Data warehousig: iceberg cube ad cube-gradiet Sematic data compressio: fascicles Broad applicatios 3

Basic Cocepts: Frequet Patters Tid Customer buys beer Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper itemset: A set of items k-itemset X = {x 1,, x k } (absolute) support, or, support cout of X: Frequecy or occurrece of a itemset X (relative) support, s, is the fractio of trasactios that cotais X (i.e., the probability that a trasactio cotais X) A itemset X is frequet if X s support is o less tha a misup threshold 4

Closed Patters ad Max-Patters A log patter cotais a combiatorial umber of sub-patters, e.g., {a 1,, a 100 } cotais 2 100 1 = 1.27*10 30 sub-patters! Solutio: Mie closed patters ad max-patters istead A itemset X is a closed patter if X is frequet ad there exist o super-patters with the same support all super-patters must have smaller support A itemset X is a max-patter if X is frequet ad there exist o super-patters that are frequet Relatioship betwee the two? Closed patters are a lossless compressio of freq. patters, whereas max-patters are a lossy compressio Lossless: ca derive all frequet patters as well as their support Lossy: ca derive all frequet patters 5

Closed Patters ad Max-Patters DB = {<a 1,, a 100 >, < a 1,, a 50 >} mi_sup = 1 What is the set of closed patters? <a 1,, a 100 >: 1 < a 1,, a 50 >: 2 How to derive frequet patters ad their support values? What is the set of max-patters? <a 1,, a 100 >: 1 How to derive frequet patters? What is the set of all patters? {a 1 }: 2,, {a 1, a 2 }: 2,, {a 1, a 51 }: 1,, {a 1, a 2,, a 100 }: 1 A big umber: 2 100 1 6

Closed Patters ad Max-Patters For a give dataset with itemset I = {a,b,c,d} ad mi_sup = 8, the closed patters are {a,b,c,d} with support of 10, {a,b,c} with support of 12, ad {a, b,d} with support of 14. Derive the frequet 2- itemsets together with their support values {a,b}: 14 {a,c}: 12 {a,d}: 14 {b,c}: 12 {b,d}: 14 {c,d}: 10 7

Chapter 6: Miig Frequet Patters, Associatio ad Correlatios Basic cocepts Frequet itemset miig methods Costrait-based frequet patter miig (ch7) Associatio rules 8

Scalable Frequet Itemset Miig Methods Apriori: A Cadidate Geeratio-ad-Test Approach Improvig the Efficiecy of Apriori FPGrowth: A Frequet Patter-Growth Approach ECLAT: Frequet Patter Miig with Vertical Data Format 9

Scalable Methods for Miig Frequet Patters The dowward closure (ati-mootoic) property of frequet patters Ay subset of a frequet itemset must be frequet If {beer, diaper, uts} is frequet, so is {beer, diaper} i.e., every trasactio havig {beer, diaper, uts} also cotais {beer, diaper} Scalable miig methods: Three major approaches Apriori (Agrawal & Srikat@VLDB 94) Freq. patter growth (Fpgrowth: Ha, Pei & Yi @SIGMOD 00) Vertical data format (Charm Zaki & Hsiao @SDM 02) 10

Apriori: A Cadidate Geeratio-ad-Test Approach Apriori pruig priciple: If there is ay itemset that is ifrequet, its superset should ot be geerated/tested! (Agrawal & Srikat @VLDB 94, Maila, et al. @ KDD 94) Method: Iitially, sca DB oce to get frequet 1-itemset Geerate legth (k+1) cadidate itemsets from legth k frequet itemsets Test the cadidates agaist DB Termiate whe o frequet or cadidate set ca be geerated 11

The Apriori Algorithm A Example Tid DB Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e 1 st sca Itemset sup {a} 2 L C 1 1 {b} 3 {c} 3 {d} 1 {e} 3 C 2 C 2 {a, b} 1 L 2 Itemset sup 2 d sca {a, c} 2 {b, c} 2 {b, e} 3 {c, e} 2 Itemset sup {a, c} 2 {a, e} 1 {b, c} 2 {b, e} 3 {c, e} 2 Itemset sup {a} 2 {b} 3 {c} 3 {e} 3 mi_sup= 2 Itemset {a, b} {a, c} {a, e} {b, c} {b, e} {c, e} C Itemset 3 3 rd sca L 3 {b, c, e} Itemset sup {b, c, e} 2 12

The Apriori Algorithm (Pseudo-code) C k : Cadidate itemset of size k L k : frequet itemset of size k L 1 = {frequet items}; for (k = 1; L k!= ; k++) do begi C k+1 = cadidates geerated from L k ; for each trasactio t i database do icremet the cout of all cadidates i C k+1 that are cotaied i t L k+1 = cadidates i C k+1 with mi_support ed retur k L k ; 13

Implemetatio of Apriori Geerate cadidates, the cout support for the geerated cadidates How to geerate cadidates? Step 1: self-joiig L k Step 2: pruig Example: L 3 ={abc, abd, acd, ace, bcd} Self-joiig: L 3 *L 3 abcd from abc ad abd acde from acd ad ace Pruig: acde is removed because ade is ot i L 3 C 4 ={abcd} The above procedures do ot miss ay legitimate cadidates. Thus Apriori mies a complete set of frequet patters. 14

How to Cout Supports of Cadidates? Why coutig supports of cadidates a problem? The total umber of cadidates ca be very huge Oe trasactio may cotai may cadidates Method: Cadidate itemsets are stored i a hash-tree Leaf ode of hash-tree cotais a list of itemsets ad couts Iterior ode cotais a hash table Subset fuctio: fids all the cadidates cotaied i a trasactio 15

Example: Coutig Supports of Cadidates Subset fuctio 3,6,9 1,4,7 2,5,8 Trasactio: 1 2 3 5 6 1 + 2 3 5 6 1 3 + 5 6 1 2 + 3 5 6 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 2 3 4 5 6 7 1 3 6 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 16

Further Improvemet of the Apriori Method Major computatioal challeges Multiple scas of trasactio database Huge umber of cadidates Tedious workload of support coutig for cadidates Improvig Apriori: geeral ideas Reduce passes of trasactio database scas Shrik umber of cadidates Facilitate support coutig of cadidates 17

Apriori applicatios beyod freq. patter miig Give a set S of studets, we wat to fid each subset of S such that the age rage of the subset is less tha 5. Apriori algorithm, level-wise search usig the dowward closure property for pruig to gai efficiecy Ca be used to search for ay subsets with the dowward closure property (i.e., ati-mootoe costrait) CLIQUE for subspace clusterig used the same Apriori priciple, where the oe-dimesioal cells are the items 18

Chapter 6: Miig Frequet Patters, Associatio ad Correlatios Basic cocepts Frequet itemset miig methods Costrait-based frequet patter miig (ch7) Associatio rules 19

Costrait-based (Query-Directed) Miig Fidig all the patters i a database autoomously? urealistic! The patters could be too may but ot focused! Data miig should be a iteractive process User directs what to be mied usig a data miig query laguage (or a graphical user iterface) Costrait-based miig User flexibility: provides costraits o what to be mied Optimizatio: explores such costraits for efficiet miig costrait-based miig: costrait-pushig, similar to push selectio first i DB query processig Note: still fid all the aswers satisfyig costraits, ot fidig some aswers i heuristic search 20

Costraied Miig vs. Costrait-Based Search Costraied miig vs. costrait-based search/reasoig Both are aimed at reducig search space Fidig all patters satisfyig costraits vs. fidig some (or oe) aswer i costrait-based search i AI Costrait-pushig vs. heuristic search It is a iterestig research problem o how to itegrate them Costraied miig vs. query processig i DBMS Database query processig requires to fid all Costraied patter miig shares a similar philosophy as pushig selectios deeply i query processig

Costrait-Based Frequet Patter Miig Patter space pruig costraits Ati-mootoic: If costrait c is violated, its further miig ca be termiated Mootoic: If c is satisfied, o eed to check c agai Succict: c must be satisfied, so oe ca start with the data sets satisfyig c Covertible: c is ot mootoic or ati-mootoic, but it ca be coverted ito it if items i the trasactio ca be properly ordered Data space pruig costrait Data succict: Data space ca be prued at the iitial patter miig process Data ati-mootoic: If a trasactio t does ot satisfy c, t ca be prued from its further miig 22

Ati-Mootoicity i Costrait Pushig Ati-mootoicity Whe a itemset S violates the costrait, so does ay of its superset sum(s.price) v is ati-mootoic sum(s.price) v is ot ati-mootoic C: rage(s.profit) 15 is ati-mootoic Itemset ab violates C So does every superset of ab support cout >= mi_sup is atimootoic core property used i Apriori TDB (mi_sup=2) TID Trasactio 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10

Mootoicity for Costrait Pushig Mootoicity Whe a itemset S satisfies the costrait, so does ay of its superset sum(s.price) v is mootoic mi(s.price) v is mootoic C: rage(s.profit) 15 Itemset ab satisfies C So does every superset of ab Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 24

Succictess Give A 1, the set of items satisfyig a succictess costrait C, the ay set S satisfyig C is based o A 1, i.e., S cotais a subset belogig to A 1 Idea: Without lookig at the trasactio database, whether a itemset S satisfies costrait C ca be determied based o the selectio of items If a costrait is succict, we ca directly geerate precisely the sets that satisfy it, eve before support coutig begis. Avoids substatial overhead of geerate-ad-test, i.e., such costrait is pre-coutig pushable mi(s.price) v is succict sum(s.price) v is ot succict

Costrait-Based Miig A Geeral Picture Costrait Atimootoe Mootoe Succict v S o yes yes S V o yes yes S V yes o yes mi(s) v o yes yes mi(s) v yes o yes max(s) v yes o yes max(s) v o yes yes cout(s) v yes o weakly cout(s) v o yes weakly sum(s) v ( a S, a 0 ) yes o o sum(s) v ( a S, a 0 ) o yes o rage(s) v yes o o rage(s) v o yes o avg(s) θ v, θ { =,, } covertible covertible o support(s) ξ yes o o support(s) ξ o yes o 26

Chapter 6: Miig Frequet Patters, Associatio ad Correlatios Basic cocepts Frequet itemset miig methods Costrait-based frequet patter miig (ch7) Associatio rules 27

Basic Cocepts: Associatio Rules A associatio rule is of the form X à Y, where X,Y I, X Y = φ A rule is strog if it satisfies both support ad cofidece thresholds. support(x->y): probability that a trasactio cotais X Y, i.e., support(x->y) = P(X U Y) Ca be estimated by the percetage of trasactios i DB that cotai X Y. Not to be cofused with P(X or Y) cofidece(x->y): coditioal probability that a trasactio havig X also cotais Y, i.e. cofidece(x->y) = P(Y X) cofidece(x->y) = P(Y X) = support(x Y) / support (X) = support_cout(x Y) / support_cout(x) cofidece(x->y) ca be easily derived from the support cout of X ad the support cout of X Y. Thus associatio rule miig ca be reduced to frequet patter miig 28

Basic Cocepts: Associatio rules Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper Let misup = 50%, micof = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Associatio rules: (may more!) Beer à Diaper (60%, 100%) Diaper à Beer (60%, 75%) Customer buys both Customer buys diaper If {a} => {b} is a associatio rule, the {b} => {a} is also a associatio rule? q Same support, differet cofidece Customer buys beer If {a,b} => {c} is a associatio rule, the {b} => {c} is also a associatio rule? If {b} => {c} is a associatio rule the {a,b} => {c} is also a associatio rule? 29

Iterestigess Measure: Correlatios (Lift) play basketball eat cereal [40%, 66.7%] is misleadig The overall % of studets eatig cereal is 75% > 66.7%. play basketball ot eat cereal [20%, 33.3%] is more accurate, although with lower support ad cofidece Support ad cofidece are ot good to idicate correlatios Measure of depedet/correlated evets: lift P( A B) lift = P( A) P( B) 2000 / 5000 lift( B, C) = = 0.89 3000 / 5000*3750 / 5000 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 1000 / 5000 lift( B, C) = = 1.33 3000 / 5000*1250 / 5000 30