Università di Pisa A.A Data Mining II June 13th, < {A} {B,F} {E} {A,B} {A,C,D} {F} {B,E} {C,D} > t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7

Similar documents
CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

CS145: INTRODUCTION TO DATA MINING

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

CS145: INTRODUCTION TO DATA MINING

Anomaly (outlier) detection. Huiping Cao, Anomaly 1

CS145: INTRODUCTION TO DATA MINING

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Bayesian non-parametric model to longitudinally predict churn

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Anomaly Detection. Jing Gao. SUNY Buffalo

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Sequential Pattern Mining

Midterm, Fall 2003

Stat 502X Exam 2 Spring 2014

Handling a Concept Hierarchy

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

A Posteriori Corrections to Classification Methods.

Introduction to Machine Learning Midterm Exam

Holdout and Cross-Validation Methods Overfitting Avoidance

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Machine Learning: Pattern Mining

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

CS 584 Data Mining. Association Rule Mining 2

Algorithms for Classification: The Basic Methods

Click Prediction and Preference Ranking of RSS Feeds

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

10-701/ Machine Learning - Midterm Exam, Fall 2010

Performance Evaluation

Introduction to Machine Learning Midterm Exam Solutions

Time Series Classification

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Examination Artificial Intelligence Module Intelligent Interaction Design December 2014

CS249: ADVANCED DATA MINING

CS540 ANSWER SHEET

Graph-Based Anomaly Detection with Soft Harmonic Functions

Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

Naïve Bayesian. From Han Kamber Pei

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

Data Mining Part 4. Prediction

Least Squares Classification

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

CHAPTER-17. Decision Tree Induction

CSE 546 Final Exam, Autumn 2013

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

26 Chapter 4 Classification

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

CS 484 Data Mining. Association Rule Mining 2

CSE 5243 INTRO. TO DATA MINING

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

DATA MINING - 1DL105, 1DL111

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

CSE 5243 INTRO. TO DATA MINING

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Machine Learning 3. week

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Evaluating Electricity Theft Detectors in Smart Grid Networks. Group 5:

CS 188: Artificial Intelligence. Outline

Data Mining and Analysis: Fundamental Concepts and Algorithms

Mining Emerging Substrings

CS6220: DATA MINING TECHNIQUES

Generalization, Overfitting, and Model Selection

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

CS6220: DATA MINING TECHNIQUES

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Decision Trees: Overfitting

Introduction to Machine Learning Midterm, Tues April 8

Fine-grained Photovoltaic Output Prediction using a Bayesian Ensemble

Numerical Learning Algorithms

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

Frequent Pattern Mining: Exercises

Lecture Notes for Chapter 6. Introduction to Data Mining

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Decision T ree Tree Algorithm Week 4 1

15 Introduction to Data Mining

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

CS6220: DATA MINING TECHNIQUES

CS246 Final Exam, Winter 2011

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

ECE 592 Topics in Data Science

Introduction Association Rule Mining Decision Trees Summary. SMLO12: Data Mining. Statistical Machine Learning Overview.

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Transcription:

Università di Pisa A.A. 2016-2017 Data Mining II June 13th, 2017 Exercise 1 - Sequential patterns (6 points) a) (3 points) Given the following input sequence < {A} {B,F} {E} {A,B} {A,C,D} {F} {B,E} {C,D} > t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7 show all the occurrences (there can be more than one or none, in general) of each of the following subsequences in the input sequence above. Repeat the exercise twice: the first time considering no temporal constraints (left column): the second time considering min-gap = 1 (i.e. gap > 1, right column). Each occurrence should be represented by its corresponding list of time stamps, e.g..: <0,2,3> = <t=0, t=2, t=3>. Occurrences Occurrences with min-gap =1 ex.: <{B}{E}> <1,2> <1,6> <3,6> <1,6><3,6> w 1 = <{A} {B} {E} > w 2 = <{B}{D}> w 3 = <{F}{E}{C,D}> Occurrences Occurrences with min-gap =1 ex.: <{B}{E}> <1,2> <1,6> <3,6> <1,6><3,6> w 1 = <{A} {B} {E} > <0,1,2> <0,1,6> <0,3,6> <0,3,6> w 2 = <{B}{D}> <1,4> <1,7> <3,4> <3,7> <6,7> <1,4> <1,7> <3,7> w 3 = <{F}{E}{C,D}> <1,2,4> <1,2,7> <1,6,7> <5,6,7> none b) (3 points) Running the GSP algorithm on a dataset of sequences, at the end of the second iteration it found the frequent 3-sequences on the left, and at the next iteration it generated (among the others) the candidate 4-sequences on the right. Which of the candidates will be pruned, and why?

Frequent 3-sequences { A B } { C } { A B } { D } { A } { C D } { B } { C D } { A } { C } { C } { A } { C } { D } { A } { D } { C } { B } { C } { C } { B } { C } { D } { B } { D } { C } { D } { C } { C } { D } { C } { D } Candidates 1. { A B } { C D } 2. { A } { D } { C } { D } 3. { B } { D } { C } { D } 4. { A B } { D } { C } 5. { A B } { C } { D } Candidates 1. { A B } { C D } 2. { A } { D } { C } { D } PRUNED 3. { B } { D } { C } { D } PRUNED 4. { A B } { D } { C } 5. { A B } { C } { D } Exercise 2 - Time series / Distances (6 points) Given the following time series: compute (i) their Euclidean distance, (ii) their DTW, and (iii) their DTW with Sakoe-Chiba band of size r=1 (i.e. all cells at distance <= 1 from the diagonal are allowed). For points (ii) and (iii) show the cost matrix and the optimal path found. Euclidean: sqrt(74.0) = 8.60 DTW: 14 DTW r=1: 17

Exercise 3 - Analysis process & CRISP-DM (4 points) A large telecom operator realized that in recent years it has been losing also several of its long-standing (more than 10- years) customers due to competitors. In order to fight this phenomenon, the company needs both to identify potential churners and understand what makes (or will make) them to leave. Notice that at this stage the operator is not interested in the recent customers they will be processed apart, with different initiatives. Briefly describe a project plan to help the company, (loosely) following the CRISP-DM methodology. Clearly remark the choices and assumptions made in the process. The operator has the following information: All demographic information on its customers The details of all their transactions (calls, data traffic usage, etc.) and contracts (including type of contract, start date and termination) Key steps: select long-standing customers; label customers as churners vs. non-churners; extract features about customer activity (frequency of class, traffic data usage rates) and contracts (most frequent type? Variability?). Build a classification model better a decision tree, for readibility reasons. Exercise 4 - Classification (6 points) a) Naive Bayes (3 points) Given the training set below, build a Naive Bayes classification model (i.e. the corresponding table of probabilities) using (i) the normal formula and (ii) using Laplace formula. What are the main effects of Laplace on the models?

b) k-nn (3 points) Given the training set on the right, composed of elements numbered from 1 to 12, and labelled as circles and diamonds, use it to classify the remaining 3 elements (letters A, B and C) using a k-nn classifier with k=3. For each point to classify, list the points of the dataset that belong to its k-nn set. Notice: A, B and C belong to the test set, not to the training set. Also, the Euclidean distance should be used. knn(a) = {3, 5, 12} CIRCLE knn(b) = { 3, 5, 7, 10 } CIRCLE knn(c) = { 4, 6, 7 } CIRCLE Exercise 5 - Outlier Detection (6 points) Given the dataset of 10 points below, consider the outlier detection problem for points A and B, adopting the following three methods: a) Distance-based: DB(ε,π) (2 points) Are A and/or B outliers, if thresholds are forced to ε = 2.1 and π = 0.15? The point itself should not be counted. b) Density-based: LOF (2 points) Compute the LOF score for points A and B by taking k=2, i.e. comparing each point with its 2 NNs (not counting the point itself). In order to simplify the calculations, the reachability-distance used by LOF can be replaced by the simple Euclidean distance. c) Depth-based (2 points) Compute the depth score of all points. Are A and/or B outliers of depth 1?

LRD(A) = 1/ [ (2 + 5)/2 ] = 0.472 LRD(5) = 1/ [ ( 5 + 5)/2 ] = 0.447 LRD(6) = 1/ [ (2 + 5)/2 ] = 0.472 LOF(A) = ( [ LRD(5) + LRD(6) ]/2 ) / LRD(A) = [ (0.472 + 0.447) / 2] / 0.472 = 0.973 LRD(B) = 1/ [ (2 + 2)/2 ] = 0.586 LRD(3) = 1/ [ ( 2 + 2 + 2)/3 ] = 0.707 LRD(4) = 1/ [ (2 + 2 + 2)/3 ] = 0.554 LOF(B) = ( [ LRD(3) + LRD(4) ]/2 ) / LRD(B) = [ ( 0.707 + 0.554) / 2] / 0.586 = 0.929

Exercise 3 - Validation (3 points) Lift chart (3 points) Given the following decision tree on left, where the leaves also show the confidence of each prediction, and given the test set on the right, build the corrisponding Lift chart. Show the process you followed.