Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F

Similar documents
CS 484 Data Mining. Association Rule Mining 2

CS 584 Data Mining. Association Rule Mining 2

Association Rules. Fundamentals

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Data Mining Concepts & Techniques

Lecture Notes for Chapter 6. Introduction to Data Mining

DATA MINING - 1DL360

DATA MINING - 1DL360

Data Mining and Analysis: Fundamental Concepts and Algorithms

COMP 5331: Knowledge Discovery and Data Mining

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Association Analysis Part 2. FP Growth (Pei et al 2000)

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Frequent Itemset Mining

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

732A61/TDDD41 Data Mining - Clustering and Association Analysis

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Data Mining and Analysis: Fundamental Concepts and Algorithms

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Frequent Itemset Mining

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

ST3232: Design and Analysis of Experiments

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

Experimental design (DOE) - Design

Frequent Itemset Mining

Design theory for relational databases

Descrip<ve data analysis. Example. E.g. set of items. Example. Example. Data Mining MTAT (6EAP)

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example

COM111 Introduction to Computer Engineering (Fall ) NOTES 6 -- page 1 of 12

UNIT 3 BOOLEAN ALGEBRA (CONT D)

TWO-LEVEL FACTORIAL EXPERIMENTS: REGULAR FRACTIONAL FACTORIALS

Chapter 11: Factorial Designs

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

Association Analysis. Part 2

Frequent Itemset Mining

Geometry Problem Solving Drill 08: Congruent Triangles

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

SOLUTION. Taken together, the preceding equations imply that ABC DEF by the SSS criterion for triangle congruence.

Chapters 6 & 7, Frequent Pattern Mining

CS246 Final Exam. March 16, :30AM - 11:30AM

Strategy of Experimentation III

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s

Institutionen för matematik och matematisk statistik Umeå universitet November 7, Inlämningsuppgift 3. Mariam Shirdel

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

Association)Rule Mining. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management

MATH 251 MATH 251: Multivariate Calculus MATH 251 FALL 2006 EXAM-II FALL 2006 EXAM-II EXAMINATION COVER PAGE Professor Moseley

CS 5014: Research Methods in Computer Science

The One-Quarter Fraction

Session 3 Fractional Factorial Designs 4

Construction of Mixed-Level Orthogonal Arrays for Testing in Digital Marketing

Temporal Data Mining

CSE 5243 INTRO. TO DATA MINING

Interesting Patterns. Jilles Vreeken. 15 May 2015

STAT451/551 Homework#11 Due: April 22, 2014

Association Analysis 1

FP-growth and PrefixSpan

CHAPTER 3 BOOLEAN ALGEBRA

Data mining, 4 cu Lecture 5:

CHAPTER 5 KARNAUGH MAPS

MATH 251 MATH 251: Multivariate Calculus MATH 251 FALL 2005 EXAM-I FALL 2005 EXAM-I EXAMINATION COVER PAGE Professor Moseley

USE OF COMPUTER EXPERIMENTS TO STUDY THE QUALITATIVE BEHAVIOR OF SOLUTIONS OF SECOND ORDER NEUTRAL DIFFERENTIAL EQUATIONS

CSE 5243 INTRO. TO DATA MINING

Four Paradigms in Data Mining

Triangles. 3.In the following fig. AB = AC and BD = DC, then ADC = (A) 60 (B) 120 (C) 90 (D) none 4.In the Fig. given below, find Z.

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Karnaugh Map & Boolean Expression Simplification

GCSE style questions arranged by topic

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Given. Segment Addition. Substitution Property of Equality. Division. Subtraction Property of Equality

Advanced Digital Design with the Verilog HDL, Second Edition Michael D. Ciletti Prentice Hall, Pearson Education, 2011

Lecture 12: Feb 16, 2017

MATH 251 MATH 251: Multivariate Calculus MATH 251 FALL 2005 EXAM-IV FALL 2005 EXAM-IV EXAMINATION COVER PAGE Professor Moseley

Association (Part II)

Mathematics. Exercise 6.4. (Chapter 6) (Triangles) (Class X) Question 1: Let and their areas be, respectively, 64 cm 2 and 121 cm 2.

Machine Learning: Pattern Mining

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

CS 412 Intro. to Data Mining

TRIANGLES CHAPTER 7. (A) Main Concepts and Results. (B) Multiple Choice Questions

Basic Quadrilateral Proofs

MATH 251 MATH 251: Multivariate Calculus MATH 251 FALL 2005 EXAM-3 FALL 2005 EXAM-III EXAMINATION COVER PAGE Professor Moseley

Combinatorial Analysis

SOLUTIONS SECTION A [1] = 27(27 15)(27 25)(27 14) = 27(12)(2)(13) = cm. = s(s a)(s b)(s c)

CS 5014: Research Methods in Computer Science. Experimental Design. Potential Pitfalls. One-Factor (Again) Clifford A. Shaffer.

CHAPTER 7 THE SAMPLING DISTRIBUTION OF THE MEAN. 7.1 Sampling Error; The need for Sampling Distributions

Maintaining Frequent Itemsets over High-Speed Data Streams

CPT+: A Compact Model for Accurate Sequence Prediction

Fractional Factorials

α-acyclic Joins Jef Wijsen May 4, 2017

Mathematics 2260H Geometry I: Euclidean geometry Trent University, Winter 2012 Quiz Solutions

CHAPTER III BOOLEAN ALGEBRA

Angles of Elevation and Depression

Nozha Directorate of Education Form : 2 nd Prep

Transcription:

Exercise 1 min_sup = 0.3 Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F Items support Type ab 0.3 M ac 0.2 I ad 0.4 F ae 0.4 F bc 0.3 M bd 0.6 C be 0.4 F cd 0.4 C ce 0.2 I de 0.6 C Items support Type abc I abd 0.2 I abe 0.2 I acd I ace I ade 0.4 M bcd 0.2 I bce I bde 0.4 M cde I Support for abc, acd and ace is not counted since itemset ac is infrequent and therefore any superset cannot be frequent which means that the superset cannot also be closed or maximal. The same counts for ace, bce and cde, because itemset ce was found infrequent. Items support Type abcd I abce I abde 0.2 I acde I bcde I Support for abcd, abce, acde and bcde is not counted since they have subsets which are infrequent.

Itemsets ab,bc, ade and bde are found maximal since they are frequent but none of their supersets are frequent. Maximal itemsets are also closed! Itemsets a,b,c,d,ab,bc, bd, cd,,de, ade, bde are closed since none of their immediate supersets have the same support as they have. Itemset e is frequent but not closed since itemset de has the same support as it. Since the itemset b is not closed it cannot be maximal. The same is true for itemsets ad (s(ad)=s(ade)), ae (s(ae)=s(ade)) and be (s(be)=s(bde)). The rest of itemsets are infrequent since they haven t got enough big support. Figure 1. Green nodes are frequent (also null), white infrequent, blue maximal and boxed ones closed itemsets. Exercise 2 Added nodes are marked as green and updated nodes as orange. a) alphabetical order

Adding {abde} B:1 Adding {bcd} B:1 B:1

Adding {abde} B:1 B:2 Adding {acde} A:3 B:1 B:2

Adding {bcde} A:3 B:2 B:2 C:2 Adding {bde} A:3 B:3 B:2 C:2

Adding {cd} A:3 B:3 B:2 C:2 Adding {abc} A:4 B:3 B:3 C:2

Adding {ade} A:5 B:3 B:3 C:2 Adding {bd} A:5 B:4 B:3 C:2

b) descending order of support order: D:0.9, B:0.7, E:0.6, A:0.5, C:0.5 Adding {dbea} B:1 Adding {dbc} B:2

Adding {dbea} D:3 B:3 Adding {deac} D:4 B:3

Adding {dbec} D:5 B:4 E:3 Adding {dbe} D:6 B:5 E:4

Adding {dc} D:7 B:5 E:4 Adding {bac} D:7 B:1 B:5 E:4

Adding {dea} D:8 B:1 B:5 E:4 Adding {db} D:9 B:1 B:6 E:4

Exercise 3 min_sup = 0.3 a) {d}-conditional tree set of paths ending in d A:4 B:4 B:2 C:2 and {d}-conditional tree A:4 B:4 B:2 C:2 b) {c,d}-conditional tree

set of paths ending in cd B:2 C:2 and {c,d}-conditional tree Exercise 4 a) b c c not c b 3 4 7 not b 2 1 3 5 5 10 The first cell (b and c) has the support of itemset {bc}, cell b not c has count of transactions which include b but not c etc.

b) a d d not d d 4 1 5 not d 5 0 5 9 1 10 c) b d d not d b 6 1 7 not b 3 0 3 9 1 10 d) e c c not c e 2 4 6 not e 3 1 4 5 5 10 e) e a a not a e 4 2 6 not e 1 3 4 5 5 10 Exercise 5 a) confidence: c(a B) = s(ab) / s(a) c(b c) = s(bc)/s(b) = 3/7 0.429 c(a d) = s(ad)/s(a) = 4/5 = 0.8 c(b d) = s(bd)/s(b) = 6/7 0.857 c(e c) = s(ec)/s(e) = 2/6 0.333 c(e a) = s(ea)/s(e) = 4/6 0.667 descending order by confidence: b d, a d, e a, b c, e c b) interest factor: I(AB) = s(ab)/(s(a)s(b)) I<1 denotes the pattern occurs less often than expected from independent events I(b,c) = s(bc)/(s(b)s(c)) = 3/(7*5) 0.857 I(a,d) = s(ad)/(s(a)s(d)) = 4/(5*9) 0.889 I(b,d) = s(bd)/(s(b)s(d)) = 6/(7*9) 0.952 I(e,c) = s(ec)/(s(e)s(c) ) = 2/(6*5) 0.667 I(e,a) = s(ea)/(s(e)s(a)) = 4/(6*5) 1.333 descending order by interest factor e a, b d, a d, b c, e c

c) IS: IS(A,B) = s(ab)/sqrt(s(a)s(b)) IS(b,c) = s(bc)/sqrt(s(b)s(c)) = 3/sqrt(7*5) 0507 IS(a,d) = s(ad)/sqrt(s(a)s(d)) = 4/sqrt(5*9) 0.596 IS(b,d) = s(bd)/sqrt(s(b)s(d)) =6/sqrt(7*9) 0.756 IS(e,c) = s(ec)/sqrt(s(e)s(c) ) = 2/sqrt(6*5) 0.365 IS(e,a) = s(ea)/sqrt(s(e)s(a)) = 4/sqrt(6*5) 0.730 descending order by IS measure: b d, e a, a d, b c, e c Exercise 6 a) b c Taking first randomly permuted column (4 6 2 10 3 5 7 8 9 1) and swap values of c accordingly. New database: tid items 1 abcde 2 bd 3 abcde 4 ade 5 bde 6 bcde 7 cd 8 abc 9 ade 10 bd s(b) is the same = 0.7 s(bc) = 0.4 c(b c) = s(bc)/s(b) = 0.4/0.7 0.5714 Taking second randomly permuted column (8 1 10 4 2 9 6 5 3 7) and swap values of c accordingly. New database tid items 1 abcde 2 bd 3 abde 4 acde 5 bcde 6 bde 7 d 8 abc 9 ade 10 bcd

s(b) is the same = 0.7 s(bc) = 0.4 c(b c) = s(bc)/s(b) = 0.4/0.75 0.5714 Taking third randomly permuted column (2 6 8 7 10 3 9 4 5 1) and swap values of c accordingly. New database tid items 1 abcde 2 bd 3 abcde 4 acde 5 bde 6 bde 7 d 8 abc 9 acde 10 bd s(b) is the same = 0.7 s(bc) = 0.3 c(b c) = s(bc)/s(b) = 0.3/0.7 0.4286 Taking fourth randomly permuted column (5 9 4 2 3 8 7 10 6 1) and swap values of c accordingly. New database tid items 1 abcde 2 bd 3 abcde 4 acde 5 bde 6 bcde 7 cd 8 ab 9 ade 10 bd s(b) is the same = 0.7 s(bc) = 0.3 c(b c) = s(bc)/s(b) = 0.3/0.7 0.4286 confidences ordered in descending order 0.5714, 0.5714, 0.4286, 0.4286 confidence from original data c(b c) = 0.429

p-value: two of datasets has higher confidence for rule b c p = r/n = 2/4 = 0.5 Exercise 7 a) The original confidence from ex 5.a) : c(b c) 0.429 p = r/n Because the original confidence hits to the second slot we choose the worst possibility for us, which means that we get the biggest p-value. Therefore we take the count of three bins having higher or equal confidence and use the sum as r. r = 42+45+6 = 93 and p = 93/100 = 0.93 b) The original confidence from ex5.a): c(a d) 0.8 The original confidence hits the first slot. If we count the sum of all bins having equal or higher confidence we get p = r/n = 100/100 = 1 c) The original confidence from ex5.a): c(e c) 0.333 The original confidence hits the border of the first and second bin. If we count the three bins having equal or higher confidence, we get p = r/n = (24+47+26)/100 = 0.97 Based on these the rules are ranked based on p-value (worst case scenario): b c, e c, a d Note about p-value: p-value is measuring the statistical significance: what is the propability that our value is obtained randomly. Basic definition of p-value is that the p-value is the probability that a variable would assume a value greater than or equal to the observed value from random source. This means that the smaller p-values we get, more sure we can be that our value is not just accidentally "good" and therefore we want to get as small p-values as we can. To be sure that our evaluation is correct, we here try to give estimation of which we can be sure of and therefore we give the worst case scenario p-value (= every other observed confidence value in the bin our value hits is bigger than our value). This means that at least we re not lying or giving too good estimates. In reality it could well be that there s only one value higher than our confidence-value in the bin, but we cannot be sure of that and that s why we assume that all the values in the bin where our value hits are higher than our value.

Exercise 8 tid a b c d e 1 1 1 0 1 1 2 0 1 1 1 0 3 1 1 0 1 1 4 1 0 1 1 1 5 0 1 1 1 1 6 0 1 0 1 1 7 0 0 1 1 0 8 1 1 1 0 0 9 1 0 0 1 1 10 0 1 0 1 0 conditions for swapping: xi=yj=1 and xj=yi=0 swapping (row,col), (row,col) = (8,2), (4,5) tid a b c d e 8 1 1 1 0 0 4 1 0 1 1 1 condition fulfilled (8,2)=(4,5) = 1 and (8,5)=(4,2) = 0 tid a b c d e 8 1 0 1 0 1 4 1 1 1 1 0 swapping (row,col), (row, col) = (7,2), (1,5) tid a b c d e 7 0 0 1 1 0 1 1 1 0 1 1 condition not fulfilled since (7,2)!= (1,5) and (7,2) = 0 and (7,5)!= (1,2) and (1,2)=1 swapping (row,col), (row,col) = (2,4), (8,1) tid a b c d e 2 0 1 1 1 0 8 1 0 1 0 1 condition fulfilled (2,4)=(8,1) = 1 and (2,1)=(8,4) = 0 swapping tid a b c d e 2 1 1 1 0 0 8 0 0 1 1 1 swapping (row,col), (row, col) = (4,2), (3,4)

tid a b c d e 4 1 1 1 1 0 3 1 1 0 1 1 condition not fulfilled since (4,4) =(3,2)=1 swapping (row,col), (row,col) = (9,5), (2,2) tid a b c d e 9 1 0 0 1 1 2 1 1 1 1 0 condition fulfilled (9,5)=(2,2) = 1 and (9,2)=(2,5) = 0 swapping tid a b c d e 9 1 1 0 1 0 2 1 0 1 1 1 new transaction database tid a b c d e 1 1 1 0 1 1 2 1 0 1 1 1 3 1 1 0 1 1 4 1 1 1 1 0 5 0 1 1 1 1 6 0 1 0 1 1 7 0 0 1 1 0 8 0 0 1 1 1 9 1 1 0 1 0 10 0 1 0 1 0 tid items 1 abde 2 acde 3 abde 4 abcd 5 bcde 6 bde 7 cd 8 cde 9 abd 10 bd