The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Similar documents
Introduction to Data Mining

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #11: Frequent Itemsets

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

CS 484 Data Mining. Association Rule Mining 2

CS5112: Algorithms and Data Structures for Applications

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

Lecture Notes for Chapter 6. Introduction to Data Mining

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Data Mining Concepts & Techniques

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

DATA MINING - 1DL360

Associa'on Rule Mining

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Data Analytics Beyond OLAP. Prof. Yanlei Diao

DATA MINING - 1DL360

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Association Rule Mining on Web

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Association Analysis. Part 1

COMP 5331: Knowledge Discovery and Data Mining

CSE 5243 INTRO. TO DATA MINING

Chapters 6 & 7, Frequent Pattern Mining

Machine Learning: Pattern Mining

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Frequent Itemset Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Mining Infrequent Patter ns

On the Complexity of Mining Itemsets from the Crowd Using Taxonomies

Handling a Concept Hierarchy

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Approximate counting: count-min data structure. Problem definition

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

Frequent Itemset Mining

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s

Data mining, 4 cu Lecture 5:

732A61/TDDD41 Data Mining - Clustering and Association Analysis

Foundations of Math. Chapter 3 Packet. Table of Contents

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Unit II Association Rules

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets

1. Data summary and visualization

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Solving real-world problems using systems of equations

15 Introduction to Data Mining

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

OPPA European Social Fund Prague & EU: We invest in your future.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

CSE 5243 INTRO. TO DATA MINING

Mining Positive and Negative Fuzzy Association Rules

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Lecture 5: Clustering, Linear Regression

Data Mining and Analysis: Fundamental Concepts and Algorithms

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

CS246 Final Exam. March 16, :30AM - 11:30AM

Data Warehousing & Data Mining

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Lecture 5: Clustering, Linear Regression

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Course overview. Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology

Algorithmic methods of data mining, Autumn 2007, Course overview0-0. Course overview

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

b. In the situation described above, what is the value of y?

CS246 Final Exam, Winter 2011

DATA MINING - 1DL105, 1DL111

The probability of going from one state to another state on the next trial depends only on the present experiment and not on past history.

Lecture 5: Clustering, Linear Regression

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

COMP 5331: Knowledge Discovery and Data Mining

Data Warehousing & Data Mining

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

TWO ROADS DIVERGED: TRADING DIVERGENCES (TRADING WITH DR ELDER BOOK 2) BY DR ALEXANDER ELDER

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

1 [15 points] Frequent Itemsets Generation With Map-Reduce

CS 584 Data Mining. Association Rule Mining 2

Pattern Structures 1

3. Problem definition

Interesting Patterns. Jilles Vreeken. 15 May 2015

Data mining, 4 cu Lecture 7:

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Transcription:

The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day. 1 2 Support Simplest question: find sets of items that appear frequently in the baskets. Support for itemset I = the number of baskets containing all items in I. Given a support threshold s, sets of items that appear in > s baskets are called frequent itemsets. Example s={milk, coke, pepsi, beer, juice}. Support = 3 baskets. B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, p, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}. 3 4 Applications --- (1) Real market baskets: chain stores keep terabytes of information about what customers buy together. Tells how typical customers navigate stores, lets them position tempting items. Suggests tie-in tricks, e.g., run sale on diapers and raise the price of beer. High support needed, or no $$ s. Applications --- (2) Baskets = documents; items = words in those documents. Lets us find words that appear together unusually frequently, i.e., linked concepts. Baskets = sentences, items = documents containing those sentences. s that appear together too often could represent plagiarism. 5 6 1

Applications --- (3) Baskets = Web pages; items = linked pages. Pairs of pages with many common references may be about the same topic. Baskets = Web pages p ; items = pages that link to p. Pages with many of the same links may be mirrors or about the same topic. 7 Important Point Market Baskets is an abstraction that models any many-many relationship between two concepts: items and baskets. s need not be contained in baskets. The only difference is that we count cooccurrences of items related to a basket, not vice-versa. 8 Scale of Problem WalMart sells 100,000 items and can store billions of baskets. The Web has over 100,000,000 words and billions of pages. Association Rules If-then rules about the contents of baskets. {i 1, i 2,,i k } j means: if a basket contains all of i 1,,i k then it is likely to contain j. Confidence of this association rule is the probability of j given i 1,,i k. 9 10 Example + B 1 = {m, c, b} B 2 = {m, p, j} _ B 3 = {m, b} B 4 = {c, j} _ B 5 = {m, p, b} + B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} An association rule: {m, b} c. Confidence = 2/4 = 50%. Interest The interest of an association rule X Y is the absolute value of the amount by which the confidence differs from the probability of Y. 11 12 2

Example B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, p, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} For association rule {m, b} c, item c appears in 5/8 of the baskets. Interest = 2/4-5/8 = 1/8 --- not very interesting. Relationships Among Measures Rules with high support and confidence may be useful even if they are not interesting. We don t care if buying bread causes people to buy milk, or whether simply a lot of people buy both bread and milk. But high interest suggests a cause that might be worth investigating. 13 14 Finding Association Rules A typical question: find all association rules with support s and confidence c. Note: support of an association rule is the support of the set of items it mentions. Hard part: finding the high-support (frequent ) itemsets. Checking the confidence of association rules involving those sets is relatively easy. 15 Computation Model Typically, data is kept in a flat file rather than a database system. Stored on disk. Stored basket-by-basket. Expand baskets into pairs, triples, etc. as you read baskets. 16 File Organization Computation Model --- (2) Etc. Basket 1 Basket 2 Basket 3 The true cost of mining disk-resident data is usually the number of disk I/O s. In practice, association-rule algorithms read the data in passes --- all baskets read in turn. Thus, we measure the cost by the number of passes an algorithm takes. 17 18 3

Main-Memory Bottleneck For many frequent-itemset algorithms, main memory is the critical resource. As we read baskets, we need to count something, e.g., occurrences of pairs. The number of different things we can count is limited by main memory. Swapping counts in/out is a disaster. Finding Frequent Pairs The hardest problem often turns out to be finding the frequent pairs. We ll concentrate on how to do that, then discuss extensions to finding frequent triples, etc. 19 20 Naïve Algorithm Read file once, counting in main memory the occurrences of each pair. Expand each basket of n items into its n (n -1)/2 pairs. Fails if (#items) 2 exceeds main memory. Remember: #items can be 100K (Wal- Mart) or 10B (Web pages). Details of Main-Memory Counting Two approaches: 1. Count all item pairs, using a triangular matrix. 2. Keep a table of triples [i, j, c] = the count of the pair of items {i,j } is c. (1) requires only (say) 4 bytes/pair. (2) requires 12 bytes, but only for those pairs with count > 0. 21 22 Details of Approach #1 4 per pair 12 per occurring pair Method (1) Method (2) 23 Number items 1, 2, Keep pairs in the order {1,2}, {1,3},, {1,n }, {2,3}, {2,4},,{2,n }, {3,4},, {3,n }, {n -1,n }. Find pair {i, j } at the position (i 1)(n i /2) + j i. Total number of pairs n (n 1)/2; total bytes about 2n 2. 24 4

Details of Approach #2 You need a hash table, with i and j as the key, to locate (i, j, c) triples efficiently. Typically, the cost of the hash structure can be neglected. Total bytes used is about 12p, where p is the number of pairs that actually occur. Beats triangular matrix if at most 1/3 of possible pairs actually occur. 25 A-Priori Algorithm --- (1) A two-pass approach called a-priori limits the need for main memory. Key idea: monotonicity : if a set of items appears at least s times, so does every subset. Contrapositive for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets. 26 A-Priori Algorithm --- (2) Pass 1: Read baskets and count in main memory the occurrences of each item. Requires only memory proportional to #items. Pass 2: Read baskets again and count in main memory only those pairs both of which were found in Pass 1 to be frequent. Requires memory proportional to square of frequent items only. 27 Picture of A-Priori counts Frequent items Counts of candidate pairs Pass 1 Pass 2 28 Detail for A-Priori You can use the triangular matrix method with n = number of frequent items. Saves space compared with storing triples. Trick: number frequent items 1,2, and keep a table relating new numbers to original item numbers. Frequent Triples, Etc. For each k, we construct two sets of k tuples: C k = candidate k tuples = those that might be frequent sets (support > s ) based on information from the pass for k 1. L k = the set of truly frequent k tuples. 29 30 5

C 1 Filter L 1 Construct C 2 Filter L 2 Construct C 3 First pass Second pass A-Priori for All Frequent sets One pass for each k. Needs room in main memory to count each candidate k tuple. For typical market-basket data and reasonable support (e.g., 1%), k = 2 requires the most memory. 31 32 Frequent sets --- (2) C 1 = all items L 1 = those counted on first pass to be frequent. C 2 = pairs, both chosen from L 1. In general, C k = k tuples, each k 1 of which is in L k-1. L k = members of C k with support s. 33 6