Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Similar documents
Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

CS 484 Data Mining. Association Rule Mining 2

1 Frequent Pattern Mining

CS 584 Data Mining. Association Rule Mining 2

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

CS5112: Algorithms and Data Structures for Applications

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

Data Mining Concepts & Techniques

732A61/TDDD41 Data Mining - Clustering and Association Analysis

Mining Positive and Negative Fuzzy Association Rules

Lecture Notes for Chapter 6. Introduction to Data Mining

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

DATA MINING - 1DL360

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Data Mining and Analysis: Fundamental Concepts and Algorithms

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

DATA MINING - 1DL360

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Handling a Concept Hierarchy

Associa'on Rule Mining

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Formal Concept Analysis

1 [15 points] Frequent Itemsets Generation With Map-Reduce

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

CSE 5243 INTRO. TO DATA MINING

COMP 5331: Knowledge Discovery and Data Mining

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Association Analysis. Part 1

Association Analysis. Part 2

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Zijian Zheng, Ron Kohavi, and Llew Mason. Blue Martini Software San Mateo, California

Association Rule Mining on Web

CSE 5243 INTRO. TO DATA MINING

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Machine Learning: Pattern Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining and Analysis: Fundamental Concepts and Algorithms

1. Data summary and visualization

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Approximate counting: count-min data structure. Problem definition

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

DATA MINING - 1DL105, 1DL111

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

EXAM 3 MAT 167 Calculus I Spring is a composite function of two functions y = e u and u = 4 x + x 2. By the. dy dx = dy du = e u x + 2x.

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Chapter-2 Relations and Functions. Miscellaneous

FP-growth and PrefixSpan

Data-Driven Logical Reasoning

Association Analysis Part 2. FP Growth (Pei et al 2000)

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

CS738 Class Notes. Steve Revilak

COMP 5331: Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining I

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Unsupervised Learning. k-means Algorithm

A Methodology for Direct and Indirect Discrimination Prevention in Data Mining

Assignment 1 (Sol.) Introduction to Machine Learning Prof. B. Ravindran. 1. Which of the following tasks can be best solved using Clustering.

Removing trivial associations in association rule discovery

Un nouvel algorithme de génération des itemsets fermés fréquents

Effective Elimination of Redundant Association Rules

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Economics 204 Fall 2011 Problem Set 1 Suggested Solutions

CS4445 B10 Homework 4 Part I Solution

FINITE DIFFERENCES. Lecture 1: (a) Operators (b) Forward Differences and their calculations. (c) Backward Differences and their calculations.

Cse537 Ar*fficial Intelligence Short Review 1 for Midterm 2. Professor Anita Wasilewska Computer Science Department Stony Brook University

M303. Copyright c 2017 The Open University WEB Faculty of Science, Technology, Engineering and Mathematics M303 Further pure mathematics

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

OPPA European Social Fund Prague & EU: We invest in your future.

Constraint-Based Rule Mining in Large, Dense Databases

Chapters 6 & 7, Frequent Pattern Mining

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Math 240 (Driver) Qual Exam (9/12/2017)

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Concepts of a Discrete Random Variable

Introduction to Data Mining

2.1 Sets. Definition 1 A set is an unordered collection of objects. Important sets: N, Z, Z +, Q, R.

Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

ELLIPTIC CURVES BJORN POONEN

Correlation Preserving Unsupervised Discretization. Outline

Density-Based Clustering

A Novel Approach of Multilevel Positive and Negative Association Rule Mining for Spatial Databases

Sequential Pattern Mining

CS246 Final Exam. March 16, :30AM - 11:30AM

Collision. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

5 Set Operations, Functions, and Counting

Unit II Association Rules

Sets II MATH Sets II. Benjamin V.C. Collins, James A. Swenson MATH 2730

Transcription:

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran 1. Let X, Y be two itemsets, and let denote the support of itemset X. Then the confidence of the rule X Y, denoted by conf(x Y ) is (a) (b) (c) supp(x Y ) (d) supp(x Y ) (e) supp(x Y ) Confidence measures the probability of seeing items in the consequent (RHS) of the rule given that we have observed items in the antecedent (LHS) of the rule in a transaction. 2. In identifying frequent itemsets in a transactional database, we find the following to be the frequent 3-itemsets: {B, D, E}, {C, E, F}, {B, C, D}, {A, B, E}, {D, E, F}, {A, C, F}, {A, C, E}, {A, B, C}, {A, C, D}, {C, D, E}, {C, D, F}, {A, D, E}. Which among the following 4-itemsets can possibly be frequent? (a) {A, B, C, D} (b) {A, B, D, E} (c) {A, C, E, F} (d) {C, D, E, F} Sol. (d) By the apriori property, only itemset {C, D, E, F} can possibly be frequent since all of its subsets of size 3 are listed as frequent. The other 4-itemsets cannot be frequent since not all of their subsets of size 3 are frequent. For example, for the first option, the itemset {A, B, D} is not frequent. 3. Let X, Y be two itemsets, denote the support of itemset X and conf(x Y ) denote the confidence of the rule X Y, denoted by conf(x Y ). Then lift of the rule, denoted by lift(x Y is (a) 1

(b) (c) supp(x Y ) (d) (e) supp(x Y ) supp(x Y ) Sol. (d) The lift of a rule can be thought of as the ratio of the observed support to the support that would be expected if the antecedent and consequent were independent. 4. Consider the following transactional data. Transaction ID Items 1 A, B, E 2 B, D 3 B, C 4 A, B, D 5 A, C 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C Assuming that the minimum support is 2, what is the number of frequent 2-itemsets (i.e., frequent items sets of size 2)? (a) 2 (b) 4 (c) 6 (d) 8 Candidate 1-itemsets: Frequent 1-itemsets: {A} 6 {B} 7 {C} 6 {D} 2 {E} 2 {A} 6 {B} 7 {C} 6 {D} 2 {E} 2 2

Candidate 2-itemsets: {A, B} 4 {A, C} 4 {A, D} 1 {A, E} 2 {B, C} 4 {B, D} 2 {B, E} 2 {C, D} 0 {C, E} 1 {D, E} 0 Frequent 2-itemsets: {A, B} 4 {A, C} 4 {A, E} 2 {B, C} 4 {B, D} 2 {B, E} 2 5. For the same data as above, what are the number of candidate 3-itemsets and frequent 3- itemsets respectively? (a) 1, 1 (b) 2, 2 (c) 2, 1 (d) 3. 2 Sol. (b) Candidate 3-itemsets: Frequent 3-itemsets: {A, B, C} 2 {A, B, E} 2 {A, B, C} 2 {A, B, E} 2 6. Continuing with the same data, how many association rules can be derived from the frequent itemset {A, B, E}? (Note: for a frequent itemset X, consider only rules of the form S - (X-S), where S is a non-empty subset of X.) 3

(a) 3 (b) 6 (c) 7 (d) 8 Sol. (b) {A} {B, E} {B} {A, E} {E} {A, B} {A, B} {E} {A, E} {B} {B, E} {A} 7. For the same frequent itemset as mentioned above, which among the following rules have a minimum confidence of 60%? (a) A B = E (b) A E = B (c) E = A B (d) A = B E Sol. (b), (c) The confidence values for the above four rules are respectively, 2/4, 2/2, 2/2, and 2/6. Hence, only rules in (b) and (c) have the minimum required confidence. 8. Suppose we are given a large text document and the aim is to count the words of different lengths, i.e., our output will be of the form - x words of length 1, y words of length 2, and so on. Assuming a map-reduce approach to solving this problem, which among the following key-value outputs would you prefer for the map phase? (Hint: consider the solution for the reduce part asked in the next question as well to ensure a complete algorithm to solve the problem.) (a) key - word, value - length (of corresponding word) (b) key - word, value - 1 (c) key - length (of corresponding word), value - word (d) key - 1, value - word Given a word, in the map phase we create a key-value pair, where the key is the length of the word and the value is the word itself. 9. For the above question, what would be the appropriate processing action in the reduce phase? (a) for each key which is a word, compute the sum of the values corresponding to this key (b) for each key which is a number, compute the lengths of the words in the corresponding list of values 4

(c) for each key which is a number, count the number of words in the corresponding list of values In the reduce phase, all words of the same size will be available in the same reduce node. Thus, in each reduce node, counting the number of words in the list of values will give us the number of words of a particular length observed in the original document. 10. Let d 1 and d 2 be two distances according to some distance measure d. A function f is said to be (d 1, d 2, p 1, p 2 )-sensitive if (a) if d(x, y) d 1, then the probability that f(x) = f(y) is at least p 1 (b) if d(x, y) d 2, then the probability that f(x) = f(y) is at most p 2 where d(, ) is a distance function. Given such a (d 1, d 2, p 1, p 2 )-sensitive function, a better function (for use in locality sensitive hashing) would be one with (a) an increased value of p 1 (b) a decreased value of p 1 (c) an increased the value of p 2 (d) a decreased the value of p 2 Sol. (a), (d) Compared to the original function, if we can increase the value of p 1, we can ensure that if two points are close enough (d(x, y) d 1 ), then the probability of a collision is higher. This is a desirable property when performing locality sensitive hashing. Similarly, if p 2 can be reduced it would indicate that given some separation between two points (d(x, y) d 2 ), the probability of still observing a collision (which is undesirable) is reduced. 5