CS4445 B10 Homework 4 Part I Solution

Similar documents
10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

CS 584 Data Mining. Association Rule Mining 2

DATA MINING - 1DL360

DATA MINING - 1DL360

Data Mining and Analysis: Fundamental Concepts and Algorithms

COMP 5331: Knowledge Discovery and Data Mining

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS 484 Data Mining. Association Rule Mining 2

1. Data summary and visualization

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

732A61/TDDD41 Data Mining - Clustering and Association Analysis

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

1 Frequent Pattern Mining

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

Data mining, 4 cu Lecture 5:

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

CS5112: Algorithms and Data Structures for Applications

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Data Mining Concepts & Techniques

Lecture Notes for Chapter 6. Introduction to Data Mining

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Association Rule Mining on Web

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Removing trivial associations in association rule discovery

CSE 5243 INTRO. TO DATA MINING

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

CSE 5243 INTRO. TO DATA MINING

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Association Analysis Part 2. FP Growth (Pei et al 2000)

Association Analysis. Part 1

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Mining and Analysis: Fundamental Concepts and Algorithms

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

OPPA European Social Fund Prague & EU: We invest in your future.

Handling a Concept Hierarchy

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Associa'on Rule Mining

Un nouvel algorithme de génération des itemsets fermés fréquents

Unsupervised Learning. k-means Algorithm

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets

Association)Rule Mining. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management

Selecting a Right Interestingness Measure for Rare Association Rules

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

COMP 5331: Knowledge Discovery and Data Mining

Chapter 4: Frequent Itemsets and Association Rules

Frequent Pattern Mining: Exercises

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Mining Exceptional Relationships with Grammar-Guided Genetic Programming

Density-Based Clustering

Mining Positive and Negative Fuzzy Association Rules

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

CS738 Class Notes. Steve Revilak

Approximate counting: count-min data structure. Problem definition

Sequential Pattern Mining

CS 412 Intro. to Data Mining

DATA MINING - 1DL105, 1DL111

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s

Association Analysis. Part 2

An Overview of Alternative Rule Evaluation Criteria and Their Use in Separate-and-Conquer Classifiers

Approach for Rule Pruning in Association Rule Mining for Removing Redundancy

CS246 Final Exam. March 16, :30AM - 11:30AM

Data mining, 4 cu Lecture 7:

Constraint-Based Rule Mining in Large, Dense Databases

Interestingness Measures for Data Mining: A Survey

Writing proofs for MATH 51H Section 2: Set theory, proofs of existential statements, proofs of uniqueness statements, proof by cases

Zijian Zheng, Ron Kohavi, and Llew Mason. Blue Martini Software San Mateo, California

Data Warehousing & Data Mining

Descrip<ve data analysis. Example. E.g. set of items. Example. Example. Data Mining MTAT (6EAP)

Chapters 6 & 7, Frequent Pattern Mining

Data Warehousing. Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Data Warehousing & Data Mining

Data-Driven Logical Reasoning

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Unit II Association Rules

Summary. 8.1 BI Overview. 8. Business Intelligence. 8.1 BI Overview. 8.1 BI Overview 12/17/ Business Intelligence

Mining Infrequent Patter ns

NetBox: A Probabilistic Method for Analyzing Market Basket Data

Association Analysis 1

Effective and efficient correlation analysis with application to market basket analysis and network community detection

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #11: Frequent Itemsets

Introduction to Data Mining

Transcription:

CS4445 B10 Homework 4 Part I Solution Yutao Wang Consider the zoo.arff dataset converted to arff from the Zoo Data Set available at Univ. of California Irvine KDD Data Repository. 1. Load this dataset onto Weka. Remove the 1st attribute (animal_name) which is a string. Go to "Associate" and run Apriori with "numrules 30", "outputitemsets True", "verbose True", and default values for the remaining parameters. 2. Now run Apriori with "numrules 30", "outputitemsets True", "verbose True", treatzeroasmissing True", and default values for the remaining parameters. 1. [5 points] What difference do you see between the rules obtained in Parts 1 and 2 above? Explain. Part 2 doesn't generate rules with value 0 while part 1 does. Since 0 is a more common (frequent) value than 1, then part 1 generates a much larger number of rules (if Weka didn t limit the output to the first 30 rules generated), and these rules will tend to have many more occurrences of 0 values than of 1 values. 2. [5 points] From now on, consider just the second set of rules (that is, when "treatzeroasmissing True"). Find an association rule you find interesting and explain it. Include the confidence and support values in your explanation of the rule. For rule: X->Y Confidence: c(x->y) σ(xuy)/ σ (X) P(X and Y)/P(X) Support: s(x->y) σ(xuy)/n P(X) Take rule #15 as an example: hair1 backbone1 39> milk1 39 <conf:(1)> lift:(2.46) lev:(0.23) [23] conv:(23.17) The support value is 39/110, and confidence value is 39/39 1. This means that all animals in the dataset who have hair and a backbone, produce/drink milk. 3. [10 points] What are "lift", "leverage", and "conviction"? Provide an explicit formula for each one of them (look at the Weka code to find those formulas). Use the values of these metrics for the association rule you chose in the previous part to judge how interesting/useful this rule is. Lift (X->Y) c(x->y)/s(y) P(X and Y)/P(X)*P(Y) P(Y X)/P(Y). Leverage (X->Y) P(X and Y)-P(X)*P(Y) Conviction(X->Y) P(X)*P(~Y)/P(X and ~Y) In the previous example, lift 2.46>1 indicates that having hair and backbone increases the probability of producing/drinking milk by a factor of 2.46. Leverage 0.23 >0. According to Weka Leverage is the proportion of additional examples covered by both the premise and consequence above those expected if the premise and consequence were independent of each other. The total number of examples that this represents is presented in brackets following the leverage. Conviction is another measure of departure from independence. Hence, in this example, the antecedent and the consequent of the rule cover 23 instances more than expected if they were independent. This provides additional evidence that the antecedent and the consequent of this rule are not independent from each other. The high value of Conviction (23.17) reinforces this point as well.

4. Look at the itemsets generated. Let's consider in particular the generation of 5-itemsets from 4-itemsets: Minimum support: 0.35 (35 instances)... Size of set of large itemsets L(4): 8 Large Itemsets L(4): hair1 milk1 backbone1 breathes1 39 hair1 toothed1 backbone1 breathes1 38 milk1 toothed1 backbone1 breathes1 40 milk1 backbone1 breathes1 tail1 35 toothed1 backbone1 breathes1 legs4 35 toothed1 backbone1 breathes1 tail1 38 Size of set of large itemsets L(5): 1 Large Itemsets L(5): hair1 milk1 toothed1 backbone1 breathes1 38 1. [5 points] State what the "join" condition is (called "merge" in the Fk-1xFk-1 method in your textbook p. 341). Show how the "join" condition was used to generate 5-itemsets from 4-itemsets. (Warning: not all candidate 5-itemsets are shown above.) Join condition is: Merge two (k-1)-itemsets into a k-itemset if their first (k-2) items are identical. Here, merge (1) into hair1 milk1 toothed1 backbone1 breathes1 (2) toothed1 backbone1 breathes1 legs4 35 toothed1 backbone1 breathes1 tail1 38 into toothed1 backbone1 breathes1 legs4 tail1 No other pair of frequent 4-itemsets satisfies the merge condition, so no more candidate 5-itemsets are generated. 2. [5 points] State what the "subset" condition is (called "candidate pruning" in the Fk-1xFk-1 method in your textbook p. 341). Show how the "subset" condition was used to eliminate candidate 5- itemsets from consideration before unnecessarily counting their support. Subset condition: check all the subsets of the resulting k-itemset that contain (k-1) items to see if they all are frequent (that is, have enough support). If at least one of these subset is not frequent, then the k- itemset cannot be frequent (due to that apriori principle).

For these two itemsets: hair1 milk1 toothed1 backbone1 breathes1 all subsets of 4 items in this itemset are frequent hair1 milk1 backbone1 breathes1 39 hair1 toothed1 backbone1 breathes1 38 so this itemset is a candidate frequent itemset whose support will need to be calculated by scanning the dataset to determine whether or not if it is indeed frequent. toothed1 backbone1 breathes1 legs tail1 Note that backbone1 breathes1 legs4 tail1 doesn t appear on the list of frequent 4-itemsets, and therefore it is not frequent. Hence the 5-itemset cannot be frequent. 5. [10 points] Consider the following frequent 4-itemset: milk1 backbone1 breathes1 tail1 Use Algorithms 6.2 and 6.3 (pp. 351-352), which are based on Theorem 6.2, to construct all rules with Confidence 100% from this 4-itemset. Show your work by neatly constructing a lattice similar to the one depicted in Figure 6.15 (but you don't need to expand/include pruned rules). Let s use the following notation: mmilk, abackbone, rbreathes, ttail: conf(art->m) 35/60 0.5833<1, so prune all rules containing item m in their consequent. conf(mrt->a) 35/35 1 conf(mat->r) 35/35 1 conf(mar->t) 35/41 0.8537 <1, so prune all rules containing item t in their consequent. conf(rt->ma): pruned conf(at->mr) : pruned conf(ar->mt) : pruned conf(mt->ar) 35/35 1 conf(mr->at) : pruned conf(ma->rt) : pruned conf(t->mar): pruned conf(r->mat): pruned conf(a->mrt): pruned conf(m->art): pruned The final rules after pruning are: mart->{} mrt->a mat->r mt->ar

3. [5 points] Explain how the processs of mining association rules in Weka's Apriori is performed in terms of the following parameters: lowerboundminsupport, upperboundminsupport, delta, metrictype, minmetric, numrules. Since it might be difficult for a user to figure out a good value for minimum support so that sufficient association rules are produced (not too many or too few), Weka provides the ability to state the number of association rules desired instead of a min. support value. Then, it follows a process like this: min. support upperboundminsupport repeat mine association rules with this min. support and the minmetric (say confidence) threshold values. When/if at least numrules are generated, then return them and exit If not enough rules were generated, then decrease the min. support: min.support min. support - delta until min. support lowerboundminsupport 4. [10 points] Exercise 16, p. 411 of the textbook. (a) Range: (-, 1]. Here is why: Note that M 1 since P B A 1. M will be equal to 1 when P(B A) 1. On the other hand, note that when P(B A) 0, then M. In the limit when P(B) goes to 1, M goes to -. (b) M. So when P(AB) is increased while P(A) and P(B) remain unchanged, M will increase. (c) As shown in the previous formula, when P(A) is increased while P(A,B) and P(B) remain unchanged, M will decrease. (d) M remain unchanged, M will increase. 1+ (e) M A B (f) When A and B are independent then M (g) No, it is not null-invariant. M1+ 1+. So when P(B) is increased while P(A,B) and P(A), M B A 0., so M is not symmetric., when f increases then N will increase while all the other parts in the right hand side of the equation will remain the same, so M will increase. (h) No, M is not invariant under row/column scaling. M A B.

(i) Under inversion, A becomes ~A, B become ~B (where ~A is the complement of A), so: M A B ~ ~~ ~ ~ ~ ~ ~ 1 ~ ~ So after inversion the measure M(A ->B ) equals M(B->A). So M is not invariant under the inversion operation.