Sampling for Frequent Itemset Mining The Power of PAC learning

Similar documents
PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PROBLEM SET 3: PROOF TECHNIQUES

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).

ADVANCED CALCULUS - MTH433 LECTURE 4 - FINITE AND INFINITE SETS

6c Lecture 14: May 14, 2014

Readings and Exercises

Introduction to Basic Proof Techniques Mathew A. Johnson

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Modern Algebra Prof. Manindra Agrawal Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Countability. 1 Motivation. 2 Counting

CSE 20 DISCRETE MATH. Fall

Writing proofs for MATH 51H Section 2: Set theory, proofs of existential statements, proofs of uniqueness statements, proof by cases

Automata Theory and Formal Grammars: Lecture 1

UMASS AMHERST MATH 300 SP 05, F. HAJIR HOMEWORK 8: (EQUIVALENCE) RELATIONS AND PARTITIONS

Direct Proof and Counterexample I:Introduction

Math 13, Spring 2013, Lecture B: Midterm

Writing Mathematical Proofs

Direct Proof and Counterexample I:Introduction. Copyright Cengage Learning. All rights reserved.

Machine Learning: Homework 5

Course Staff. Textbook

At the start of the term, we saw the following formula for computing the sum of the first n integers:

Takeaway Notes: Finite State Automata

Introduction to Metalogic 1

CSE 20 DISCRETE MATH. Winter

Inference and Proofs (1.6 & 1.7)

Math 38: Graph Theory Spring 2004 Dartmouth College. On Writing Proofs. 1 Introduction. 2 Finding A Solution

Section 2.1: Introduction to the Logic of Quantified Statements

3 The language of proof

Mini-Course on Limits and Sequences. Peter Kwadwo Asante. B.S., Kwame Nkrumah University of Science and Technology, Ghana, 2014 A REPORT

DR.RUPNATHJI( DR.RUPAK NATH )

PHIL 422 Advanced Logic Inductive Proof

Basics of Proofs. 1 The Basics. 2 Proof Strategies. 2.1 Understand What s Going On

CSE 20 DISCRETE MATH. Winter

PAC-learning, VC Dimension and Margin-based Bounds

Proof by contrapositive, contradiction

An analogy from Calculus: limits

TECHNISCHE UNIVERSITEIT EINDHOVEN Faculteit Wiskunde en Informatica. Final exam Logic & Set Theory (2IT61) (correction model)

SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION

Preparing for the CS 173 (A) Fall 2018 Midterm 1

A Brief Introduction to Proofs

Math 90 Lecture Notes Chapter 1

Lecture 6: Greedy Algorithms I

Writing proofs for MATH 61CM, 61DM Week 1: basic logic, proof by contradiction, proof by induction

Proof Techniques (Review of Math 271)

MATH 341, Section 001 FALL 2014 Introduction to the Language and Practice of Mathematics

Introduction to Metalogic

MATH 320, WEEK 7: Matrices, Matrix Operations

Introducing Proof 1. hsn.uk.net. Contents

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Solving recurrences. Frequently showing up when analysing divide&conquer algorithms or, more generally, recursive algorithms.

CONSTRUCTION OF THE REAL NUMBERS.

A Guide to Proof-Writing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

The Inductive Proof Template

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Sequence convergence, the weak T-axioms, and first countability

Math Real Analysis

Lecture 5: closed sets, and an introduction to continuous functions

1.1 The Language of Mathematics Expressions versus Sentences

The Vapnik-Chervonenkis Dimension

Proof: If (a, a, b) is a Pythagorean triple, 2a 2 = b 2 b / a = 2, which is impossible.

HOW TO WRITE PROOFS. Dr. Min Ru, University of Houston

Lecture 4: Constructing the Integers, Rationals and Reals

Math 31 Lesson Plan. Day 2: Sets; Binary Operations. Elizabeth Gillaspy. September 23, 2011

Induction 1 = 1(1+1) = 2(2+1) = 3(3+1) 2

Math 440 Project Assignment

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Notes on State Minimization

This pre-publication material is for review purposes only. Any typographical or technical errors will be corrected prior to publication.

1.4 Mathematical Equivalence

Set theory. Math 304 Spring 2007

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Discrete Mathematics for CS Fall 2003 Wagner Lecture 3. Strong induction

Contradiction MATH Contradiction. Benjamin V.C. Collins, James A. Swenson MATH 2730

Checkpoint Questions Due Monday, October 1 at 2:15 PM Remaining Questions Due Friday, October 5 at 2:15 PM

1.1 Administrative Stuff

Homework 1 2/7/2018 SOLUTIONS Exercise 1. (a) Graph the following sets (i) C = {x R x in Z} Answer:

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

CSE 20 DISCRETE MATH WINTER

CSE 20 DISCRETE MATH SPRING

5 Set Operations, Functions, and Counting

Frequent Itemset Mining

For all For every For each For any There exists at least one There exists There is Some

Math 123, Week 2: Matrix Operations, Inverses

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

CS 124 Math Review Section January 29, 2018

Computational and Statistical Learning theory

An Algorithms-based Intro to Machine Learning

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Modal Dependence Logic

Gödel s Incompleteness Theorem. Overview. Computability and Logic

We ll start today by learning how to change a repeating decimal into a fraction! Then we will do a review of Unit 1 - half of Unit 3!

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

ε-nets and VC Dimension

Math 300: Foundations of Higher Mathematics Northwestern University, Lecture Notes

Elementary Linear Algebra, Second Edition, by Spence, Insel, and Friedberg. ISBN Pearson Education, Inc., Upper Saddle River, NJ.

CS411 Notes 3 Induction and Recursion

CSE 546 Final Exam, Autumn 2013

Transcription:

Sampling for Frequent Itemset Mining The Power of PAC learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

The Papers Today we discuss: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees Proc ECML PKDD 2012, LNCS 7523 ACM Transactions on Knowledge Discovery from Data, Vol 8, No 4, 2014 Mining Frequent Itemsets through Progressive Sampling with Rademacher Averages which was announced to be part of the course is unfortunately skipped, because it requires yet more concepts and techniques you do need to time for your experiments and for your essay. Note, we restrict ourselves to frequent itemsets only

The Problem Let D be a transaction database over I (which we assume fixed throughout the lecture). Denote by F (D, θ) the set of frequent itemsets and their support, i.e., (I, supp D (I )) F (D, θ) supp D (I ) θ. We want to compute a (small) sample S D such that F (S, θ ) F (D, θ) More formally, we want that our sample to yield an (ɛ, δ) approximation: For (ɛ, δ) (0, 1) 2, an (ɛ, δ) approximation of F (D, θ) is a set C = {(I, s(i )) I I, s(i ) (0, 1]} such that with probability at least 1 δ: 1. (I, supp D (I )) F (D, θ) (I, s(i )) C 2. (I, s(i )) C supp D (I ) θ ɛ 3. (I, s(i )) C supp D (I ) s(i ) ɛ/2

Now in Natural Language C is an (ɛ, δ) approximation of F (D, θ) if with probability at least 1 δ: C contains all frequent itemsets the non-frequent itemsets in C are almost frequent our estimate of the frequency of the itemsets in C is approximately correct. This means that (with high probability) C may contain false positives but no false negatives and there is limited loss of accuracy Which means that we can compute F (D, θ) with one scan over D using C. an advantage over Toivonen s scheme

In Terms of Samples In the terminology of samples we can rephrase our goal now as Find a sample S such that P ( I I : supp D (I ) supp S (I ) > ɛ/2) < δ Because if this holds for S, then F (S, θ ɛ/2) is an (ɛ, δ) approximations of F (D, θ) For the simple reasons that 1. with probability at least 1 δ : F (D, θ) F (S, θ ɛ/2) 2. I F (S, θ ɛ/2) [with prob at least 1 δ : supp D (I ) supp S (I ) ɛ/2 θ ɛ] 3. I F (S, θ ɛ/2) [with prob at least 1 δ : supp D (I ) supp S (I ) ɛ/2]

How Big Should S be? For each itemset I, the Chernoff bound is P ( supp D (I ) supp S (I ) ɛ/2) 2e S ɛ2 /12 Since there are a priori 2 I frequent itemsets, the union bound gives us P ( I : supp D (I ) supp S (I ) ɛ/2) 2 I 2e S ɛ2 /12 So, to have this probability less then δ we need S 12 ( ( )) 1 ɛ 2 I + ln(2) + ln δ Since I tends to be very large think of all the goods Amazon sells this is not a very good bound we should be able to do better using PAC learning!

Classifiers and Hypothesis Sets A classifier is a simply a function f : X {0, 1} A set of hypotheses H is simply a set of classifiers, i.e., set set of such functions There is a 1-1 relation between classifiers and subsets of X : each subset A of X has an indicator function 1 A : { 1 if x A 1 A (x) = 0 otherwise which is a classifier on X In other words, we can see H as a set of subsets of X

Range Space This is formalised as a range space A range space (X, R) consists of a finite or infinite set of points X a finite of infinite family R of subsets of X, called ranges For any A X, the projection of R on A is given by Π R (A) = {r A r R} So, in this terminology, we have that a subset A X is shattered by R if Π R (A) = P(A)

ɛ-approximations ɛ representative samples translate to ɛ- approximations: Let (X, R) be a range space and let A X be a finite subset. For ɛ (0, 1), a B A is an ɛ-approximation for A if r R : A r B r A B ɛ without proof (it should look and feel familiar): There is a constant c ( 0.5 experimentally) such that if (X, R) is a range space of VC dimension at most υ, A X a finite subset, (ɛ, δ) (0, 1) 2, then a random B A of cardinality m min { A, cɛ ( 2 υ + log 1 )} δ is an ɛ-approximation for A with probability at least 1 δ

A Range Space for Transaction Data To use this bound, we should first formulate our hypothesis set, i.e., our range space: Let D be a transaction database over I. The associated range space S = (X, R) is defined by 1. X = D, the set of transactions 2. R = {T D (I ) I I, I }, where T D (I ) = {t D I t} the range of I, TD (I ), is the set of transactions that support I i.e., its support set. Note that R contains exactly the support sets of the closed itemsets the support set of a non-closed itemset is the support set of a closed itemset as well. Now all we have to do is to determine the VC dimension of this range space. as more often, we ll settle for a (tight) upper bound.

A First Result Let D be a dataset with associated range space S = (X, R). Then VC(S) υ N if there exists an A X with A = υ such that B A : I B I : T A (I B ) = B : For all B A, T D (I B ) A = B, that means that B Π R (A) for all B. That means that Π R (A) = P(A). A is shattered and, thus, VC(S) υ. : If VC(S) υ then there is an A D with A = υ such that Π R (A) = P(A). Hence, for every B A, there is an itemset I B such that T D (I B ) A = B. Since T A (I B ) T D (I B ), this means that T A (I B ) A B. Now note that T A (I B ) A and by construction B T A (I B ) and thus T A (I B ) = B

An Immediate Consequence Let D be a dataset with associated range space S = (X, R). Then VC(S) is the largest υ N such that there is an A D with A = υ such that B A : I B I : T A (I B ) = B Example D = {{a, b, c, d}, {a, b}, {a, c}, {d}} and A = {{a, b}, {a, c}} I {{a,b}} = {{a, b}} I {{a,c}} = {{a, c}} I {{a}} = A I = {{d}} Larger subsets of D cannot be shattered and hence its VC dimension is 2.

Nice, But It is good to have a simple characterisation of the VC dimension. But since it puts a requirement on: B A it is potentially very costly to compute in fact, it is known to be O( R X log R ) Fortunately, our corollary (the immediate consequence) suggests an alternative we need a set of υ transactions if all of them are at least υ long we have enough freedom to make the condition hold for technical reasons we first assume that the transactions are independent, i.e., they form an antichain.

The d-index Let D be a data set. The d-index of D is the largest integer d such that D contains at least d transactions of length at least d that form an antichain Theorem Let D be a dataset with d-index d. Then the range space S = (X, R) associated with D has VC dimension of at most d This upper bound is tight there are datasets for which the VC dimension equals their d-index

Proof Sketch Let l > d and assume that T D with T = l can be shattered note that this means that T is an antichain: if t 1 t 2 all ranges containing t 2 also contain t 1 : we cannot shatter For any t T, there are 2 l 1 subsets of T that contain t. So, t occurs in 2 l 1 ranges T A. t only occurs in T A if A t. Which means that T occurs in 2 t 1 ranges. From the definition of d we know that T must contain a t such that t < l otherwise the d-index would be l This means that 2 t 1 < 2 l 1, so t cannot appear in 2 l 1 ranges. This is a contradiction. So, the assumed T cannot exist. Hence, the largest set that is shattered has at most size d.

From d-index to d-bound The d-index of D is still a bit hard to compute because of the antichain requirement So, lets us forget about that requirement. Let D be a dataset, its d-bound of D is the largest integer d such that D contains at least d transactions of length at least d Theorem Let D be a dataset with d-bound d. Then the range space S = (X, R) associated with D has VC dimension of at most d This is obvious as d-bound d-index a subset witnessing the d-index satisfies the conditions for the d-bound (but not vice versa)

Computing The d-bound Computing the d-bound is easy do a scan of the dataset maintaining the l longest (different) transactions that are at least l long breaking ties arbitrarily See the journal version for the full details and the proof!

The Sample Size (finally) Combining all the results we have seen so far, we have: Let D be a dataset, let d be the d-bound of D, and let (ɛ, δ) (0, 1) 2. Let S be a random sample of D with size { S min D, 4c ( ɛ 2 d + log 1 )} δ Then F (S, θ ɛ/2) is an ɛ close approximation of F (D, θ) with probability at least 1 δ. Such a sample we can easily compute from D in a single scan. Hence we need two scans of D to compute a ɛ close approximation of the set of all frequent itemsets.

Your Essay The essay you have to submit consists of two parts 1. A part explaining the results we achieved today: 5-7 pages 2. A part doing experiments: 3-4 pages You have to submit by April 20 April 13 would have been the correct date, but I misread the schedule, use this extra week wisely You submit by email subject of email contains: [ESSAY BIG DATA], your name and your student number provide the same information at the start of each of the two parts start each part on an odd page.

Writing an Essay Most of you did note submit essays before. The most important guideline is: Be coherent: define the concepts (terms) you use and use them coherently define before you use On writing: Use a spell checker and check whether or not a word or an expression you use means what you think it means. You have to explain non-trivial material. Long sentences are more confusing, so keep your sentences short Paragraphs have 1 message only, multiple messages means multiple paragraphs. Sections are a top-level division of your argument, use this to guide your reader.

Page Limits The page limits are strict violations will cost you dearly The reason is twofold If you cannot explain in the allotted number of pages, you probably do not understand unlimited number of pages would make marking impossible You are free to choose your favourite text processor: troff, word, latex,... given that you need graphs, tables, math, latex might be a wise choice and you are free to choose your own style the standard latex article style is probably not the best choice but note that the result should be readable

Part 1 Very briefly one could say that all you have to do is: recursively explain today s lecture In a bit more detail (not necessarily in your order): What is the problem? and why is it a problem? What are frequent itemsets? and why should we care about them? What is Toivonen s idea? and why do we want more? What is the link with classification? and why is it a good idea to follow that route? What is PAC learning? and why is it a good approach? What are the important concepts and results in PAC learning? and how do they relate to our problem? etcetera etcetera

Part 1, Continued An other way to phrase it is that you are expected to write a summary of the course material a summary targeted at the main problem Such a summary is not a list of definitions and theorems Your target audience is a CS master student who did not take this course (yet) You can, and probably should, use math to be precise but the intuition is at least as important copying math statements does not convey understanding In the end, this hypothetical student should understand what the course is all about and realize that he or she should enrol next year You should conclude your essay with an argument that conceptually compares Toivonen and Riondato was it worth the trip?

Part 2 You end part 1 with a conceptual comparison. In this part you report on an experimental comparison. For the experiments you can write your own implementations, use existing implementations, whatever you like you will always need to do some programming yourself you can choose your own data sets but make sure that they are big enough for the sampling you ll have to do if your data set is too small, you can clearly upsample but that defeats the purpose and it makes supports higher. you compare on sample sizes false positives and false negatives how far off the estimated supports are so you need the truth, compare smartly!

Part 2, Continued In your report you should describe your experimental set-up in enough detail to be repeatable give the results of your experiments in graphs, tables,... And in the end, you should conclude which approach is better according to your experiments and why Given that there are many parameters θ (for all), ɛ and δ (Riondato) and µ (Toivonen) One experiments is probably not sufficient for well founded conclusions And your mark is based on how well founded your conclusions are They are not based on what your conclusions are If your conclusions are that it is all total and utter BS, fine!