Data Mining and Knowledge Discovery: Practice Notes

Similar documents
Decision trees. Decision tree induction - Algorithm ID3

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Performance. Learning Classifiers. Instances x i in dataset D mapped to feature space:

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS145: INTRODUCTION TO DATA MINING

Machine Learning & Data Mining

C4.5 - pruning decision trees

day month year documentname/initials 1

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

EECS 349:Machine Learning Bryan Pardo

Slides for Data Mining by I. H. Witten and E. Frank

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

CSCI 5622 Machine Learning

Machine Learning Chapter 4. Algorithms

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Trees. Tirgul 5

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Trees.

Machine Learning 2nd Edi7on

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Learning Decision Trees

Decision Tree Learning

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

CS 6375 Machine Learning

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Performance Evaluation and Comparison

Decision Trees.

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Decision Tree Learning

Lecture 3: Decision Trees

Classification and Prediction

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Trees. Gavin Brown

Learning Decision Trees

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Lecture 3: Decision Trees

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

the tree till a class assignment is reached

Decision T ree Tree Algorithm Week 4 1

Data classification (II)

Linear and Logistic Regression. Dr. Xiaowei Huang

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Trees Part 1. Rao Vemuri University of California, Davis

The Solution to Assignment 6

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Tree Learning

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Machine Learning Concepts in Chemoinformatics

Classification II: Decision Trees and SVMs

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Notes on Machine Learning for and

Decision Tree Learning

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Tree Learning Lecture 2

Symbolic methods in TC: Decision Trees

CHAPTER-17. Decision Tree Induction

Decision Trees. Danushka Bollegala

Machine Learning

Holdout and Cross-Validation Methods Overfitting Avoidance

Statistical Machine Learning from Data

Chapter 3: Decision Tree Learning (part 2)

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Data Mining Project. C4.5 Algorithm. Saber Salah. Naji Sami Abduljalil Abdulhak

Machine Learning Linear Classification. Prof. Matteo Matteucci

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Nonlinear Classification

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

Empirical Risk Minimization, Model Selection, and Model Assessment

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Diagnostics. Gad Kimmel

Decision Trees Entropy, Information Gain, Gain Ratio

Machine Learning 3. week

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Stats notes Chapter 5 of Data Mining From Witten and Frank

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree And Random Forest

EVALUATING RISK FACTORS OF BEING OBESE, BY USING ID3 ALGORITHM IN WEKA SOFTWARE

Stephen Scott.

Decision Tree Learning - ID3

Model Accuracy Measures

Decision Support. Dr. Johan Hagelbäck.

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Classification using stochastic ensembles

Classification: Decision Trees

Chapter 6: Classification

Apprentissage automatique et fouille de données (part 2)

Machine Learning

Transcription:

Data Mining and Knowledge Discovery: Practice Notes dr. Petra Kralj Novak Petra.Kralj.Novak@ijs.si 7.11.2017 1

Course Prof. Bojan Cestnik Data preparation Prof. Nada Lavrač: Data mining overview Advanced topics Dr. Petra Kralj Novak Data mining basis Hand on Weka Written exam Reading clubs: Basic: Max Bramer: Principles of Data Mining (2007) Advanced: Charu C. Aggarwal : Data Mining: The Textbook Prof. Dunja Mladenić Text mining 2

Keywords Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms Decision tree induction, entropy, information gain, overfitting, Occam s razor, model pruning, naïve Bayes classifier, KNN, association rules, support, confidence, numeric prediction, regression tree, model tree, heuristics vs. exhaustive search, predictive vs. descriptive DM Evaluation Train set, test set, accuracy, confusion matrix, cross validation, true positives, false positives, ROC space, AUC, error, precision, recall 3

Decision tree induction Given Attribute-value data with nominal target variable Induce A decision tree and estimate its performance 4

Attribute-value data attributes (nominal) target variable examples... Person Age Prescription Astigmatic Tear_Rate Lenses P1 young myope no normal YES P2 young myope no reduced NO P3 young hypermetrope no normal YES P4 young hypermetrope no reduced NO P5 young myope yes normal YES P6 young myope yes reduced NO P7 young hypermetrope yes normal YES P8 young hypermetrope yes reduced NO P9 pre-presbyopic myope no normal YES P10 pre-presbyopic myope no reduced NO P11 pre-presbyopic hypermetrope no normal YES P12 pre-presbyopic hypermetrope no reduced NO P13 pre-presbyopic myope yes normal YES P14 pre-presbyopic myope yes reduced NO P15 pre-presbyopic hypermetrope yes normal NO P16 pre-presbyopic hypermetrope yes reduced NO P17 presbyopic myope no normal NO P18 presbyopic myope no reduced NO P19 presbyopic hypermetrope no normal YES P20 presbyopic hypermetrope no reduced NO P21 presbyopic myope yes normal YES P22 presbyopic myope yes reduced NO P23 presbyopic hypermetrope yes normal NO P24 presbyopic hypermetrope yes reduced NO classes = values of the (nominal) target variable 5

Training and test set Person Age Prescription Astigmatic Tear_Rate Lenses P1 young myope no normal YES P2 young myope no reduced NO P3 young hypermetrope no normal YES P4 young hypermetrope no reduced NO P5 young myope yes normal YES P6 young myope yes reduced NO P7 young hypermetrope yes normal YES P8 young hypermetrope yes reduced NO P9 pre-presbyopic myope no normal YES P10 pre-presbyopic myope no reduced NO P11 pre-presbyopic hypermetrope no normal YES P12 pre-presbyopic hypermetrope no reduced NO P13 pre-presbyopic myope yes normal YES P14 pre-presbyopic myope yes reduced NO P15 pre-presbyopic hypermetrope yes normal NO P16 pre-presbyopic hypermetrope yes reduced NO P17 presbyopic myope no normal NO P18 presbyopic myope no reduced NO P19 presbyopic hypermetrope no normal YES P20 presbyopic hypermetrope no reduced NO P21 presbyopic myope yes normal YES P22 presbyopic myope yes reduced NO P23 presbyopic hypermetrope yes normal NO P24 presbyopic hypermetrope yes reduced NO Put 30% of examples in a separate test set 6

Test set Person Age Prescription Astigmatic Tear_Rate Lenses P3 young hypermetrope no normal YES P9 pre-presbyopic myope no normal YES P12 pre-presbyopic hypermetrope no reduced NO P13 pre-presbyopic myope yes normal YES P15 pre-presbyopic hypermetrope yes normal NO P16 pre-presbyopic hypermetrope yes reduced NO P23 presbyopic hypermetrope yes normal NO Put these data away and do not look at them in the training phase! 7

Training set Person Age Prescription Astigmatic Tear_Rate Lenses P1 young myope no normal YES P2 young myope no reduced NO P4 young hypermetrope no reduced NO P5 young myope yes normal YES P6 young myope yes reduced NO P7 young hypermetrope yes normal YES P8 young hypermetrope yes reduced NO P10 pre-presbyopic myope no reduced NO P11 pre-presbyopic hypermetrope no normal YES P14 pre-presbyopic myope yes reduced NO P17 presbyopic myope no normal NO P18 presbyopic myope no reduced NO P19 presbyopic hypermetrope no normal YES P20 presbyopic hypermetrope no reduced NO P21 presbyopic myope yes normal YES P22 presbyopic myope yes reduced NO P24 presbyopic hypermetrope yes reduced NO 8

Decision tree induction Given Attribute-value data with nominal target variable Induce A decision tree and estimate its performance 9

Decision tree induction (ID3) Given: Attribute-value data with nominal target variable Divide the data into training set (S) and test set (T) Induce a decision tree on training set S: 1. Compute the entropy E(S) of the set S 2. IF E(S) = 0 3. The current set is clean and therefore a leaf in our tree 4. IF E(S) > 0 5. Compute the information gain of each attribute Gain(S, A) 6. The attribute A with the highest information gain becomes the root 7. Divide the set S into subsets S i according to the values of A 8. Repeat steps 1-7 on each S i Test the model on the test set T Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106 10

Information gain number of examples in the subset S v set S attribute A (probability of the branch) Gain( S, A) E( S) v Values( A) S v S E( S v ) number of examples in set S 11

Entropy E ( S ) N c 1 p c.log 2 p c Calculate the following entropies: E (0,1) = E (1/2, 1/2) = E (1/4, 3/4) = E (1/7, 6/7) = E (6/7, 1/7) = E (0.1, 0.9) = E (0.001, 0.999) = 12

Entropy E ( S ) N c 1 p c.log 2 p c Calculate the following entropies: E (0,1) = 0 E (1/2, 1/2) = 1 E (1/4, 3/4) = 0.81 E (1/7, 6/7) = 0.59 E (6/7, 1/7) = 0.59 E (0.1, 0.9) = 0.47 E (0.001, 0.999) = 0.01 13

Entropy E ( S ) c 1 Calculate the following entropies: E (0,1) = 0 E (1/2, 1/2) = 1 E (1/4, 3/4) = 0.81 E (1/7, 6/7) = 0.59 E (6/7, 1/7) = 0.59 E (0.1, 0.9) = 0.47 E (0.001, 0.999) = 0.01 N p c.log 2 p c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 14 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Entropy E ( S ) c 1 Calculate the following entropies: E (0,1) = 0 E (1/2, 1/2) = 1 E (1/4, 3/4) = 0.81 E (1/7, 6/7) = 0.59 E (6/7, 1/7) = 0.59 E (0.1, 0.9) = 0.47 E (0.001, 0.999) = 0.01 N p c.log 2 p c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

entropy Entropy and information gain probability of probability of class 1 class 2 entropy E(p 1, p 2 ) = p 1 p 2 = 1-p 1 -p 1 *log 2 (p 1 ) - p 2 *log 2 (p 2 ) 0 1 0.00 0.05 0.95 0.29 0.10 0.90 0.47 0.15 0.85 0.61 0.20 0.80 0.72 0.25 0.75 0.81 0.30 0.70 0.88 0.35 0.65 0.93 0.40 0.60 0.97 0.45 0.55 0.99 0.50 0.50 1.00 0.55 0.45 0.99 0.60 0.40 0.97 number of examples in the subset 0.65 0.35 0.93 probability of the "branch" 0.70 0.30 0.88 attribut A 0.75 0.25 0.81 S 0.80 0.20 0.72 v Gain ( S, A) E ( S ) E ( S v ) 0.85 0.15 0.61 v Values ( A ) S 0.90 0.10 0.47 0.95 0.05 0.29 set S 1 0 0.00 number of examples in set S 16 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 0 0.2 0.4 0.6 0.8 1 distribution of probabilities

Decision tree induction (ID3) Given: Attribute-value data with nominal target variable Divide the data into training set (S) and test set (T) Induce a decision tree on training set S: 1. Compute the entropy E(S) of the set S 2. IF E(S) = 0 3. The current set is clean and therefore a leaf in our tree 4. IF E(S) > 0 5. Compute the information gain of each attribute Gain(S, A) 6. The attribute A with the highest information gain becomes the root 7. Divide the set S into subsets S i according to the values of A 8. Repeat steps 1-7 on each Si Test the model on the test set T 17

Decision tree 18

Confusion matrix Predicted positive Predicted negative Actual positive TP FN Actual negative FP TN Confusion matrix is a matrix showing actual and predicted classifications Classification measures can be calculated from it, like classification accuracy = #(correctly classified examples) / #(all examples) = (TP+TN) / (TP+TN+FP+FN) 19

Evaluating decision tree accuracy Person Age Prescription Astigmatic Tear_Rate Lenses P3 young hypermetrope no normal YES P9 pre-presbyopic myope no normal YES P12 pre-presbyopic hypermetrope no reduced NO P13 pre-presbyopic myope yes normal YES P15 pre-presbyopic hypermetrope yes normal NO P16 pre-presbyopic hypermetrope yes reduced NO P23 presbyopic hypermetrope yes normal NO Ca = (3+2)/ (3+2+2+0) = 71% Actual positive Actual negative Predicted positive TP=3 FP=2 Predicted negative FN=0 TN=2 20

Is 71% good classification accuracy? Depends on the dataset! Compare to the majority class classifier (ZeroR in Weka) Classifies all the data in the most represented class In our Lenses example, the majority class is Lenses=NO. Accuracy on train set = 11/17 = 65% Accuracy on test set = 4/7 = 57% If we had bigger sets, these two numbers would be almost the same Since 71% > 57%, there is some improvement from the majority class classifier 21

Discussion How much is the information gain for the attribute Person? How would it perform on the test set? How do we compute entropy for a target variable that has three values? Lenses = {hard=4, soft=5, none=13} What would be the classification accuracy of our decision tree if we pruned it at the node Astigmatic? What are the stopping criteria for building a decision tree? How would you compute the information gain for a numeric attribute? 22

Discussion about decision trees How much is the information gain for the attribute Person? How would it perform on the test set? How do we compute entropy for a target variable that has three values? Lenses = {hard=4, soft=5, none=13} What would be the classification accuracy of our decision tree if we pruned it at the node Astigmatic? What are the stopping criteria for building a decision tree? How would you compute the information gain for a numeric attribute? 23

Information gain of the attribute Person On training set As many values as there are examples Each leaf has exactly one example E(1/1, 0/1) = 0 (entropy of each leaf is zero) The weighted sum of entropies is zero The information gain is maximum (as much as the entropy of the entire training set) On testing set The values from the testing set do not appear in the tree 24

Discussion about decision trees How much is the information gain for the attribute Person? How would it perform on the test set? How do we compute entropy for a target variable that has three values? Lenses = {hard=4, soft=5, none=13} What would be the classification accuracy of our decision tree if we pruned it at the node Astigmatic? What are the stopping criteria for building a decision tree? How would you compute the information gain for a numeric attribute? 25

Entropy{hard=4, soft=5, none=13}= = E(4/22, 5/22, 13/22) = - p i *log 2 p i = -4/22 * log 2 4/22 5/22 * log 2 5/22 13/22*log 2 13/22 = 1.38 26

Discussion about decision trees How much is the information gain for the attribute Person? How would it perform on the test set? How do we compute entropy for a target variable that has three values? Lenses = {hard=4, soft=5, none=13} What would be the classification accuracy of our decision tree if we pruned it at the node Astigmatic? What are the stopping criteria for building a decision tree? How would you compute the information gain for a numeric attribute? 27

Decision tree pruning Lenses = YES 28

These two trees are equivalent 29

Classification accuracy of the pruned tree Ca = (3+2)/ (3+2+2+0) = 71% Actual positive Actual negative Predicted positive TP=3 FP=2 Predicted negative FN=0 TN=2 30

Discussion about decision trees How much is the information gain for the attribute Person? How would it perform on the test set? How do we compute entropy for a target variable that has three values? Lenses = {hard=4, soft=5, none=13} What would be the classification accuracy of our decision tree if we pruned it at the node Astigmatic? What are the stopping criteria for building a decision tree? How would you compute the information gain for a numeric attribute? 31

Stopping criteria for building a decision tree ID3 Pure nodes (entropy =0) Out of attributes J48 (C4.5) Minimum number of instances in a leaf constraint 32

Discussion about decision trees How much is the information gain for the attribute Person? How would it perform on the test set? How do we compute entropy for a target variable that has three values? Lenses = {hard=4, soft=5, none=13} What would be the classification accuracy of our decision tree if we pruned it at the node Astigmatic? What are the stopping criteria for building a decision tree? How would you compute the information gain for a numeric attribute? 33

Information gain of a numeric attribute 34

Information gain of a numeric attribute Sort by Age 35

Information gain of a numeric attribute Sort by Age Define possible splitting points 36

Information gain of a numeric attribute 30.5 41.5 45.5 50.5 52.5 66 37

Information gain of a numeric attribute Age <30.5 >=30.5 30.5 41.5 6/24 18/24 E(6/6, 0/6) = 0 E(5/18, 13/18) = 0.85 45.5 50.5 52.5 66 38

Information gain of a numeric attribute E(S) = E(11/24, 13/24) = 0.99 Age <30.5 >=30.5 30.5 41.5 45.5 50.5 52.5 6/24 18/24 E(6/6, 0/6) = 0 E(5/18, 13/18) = 0.85 InfoGain (S, Age 30.5 )= = E(S) - p v E(pv) = 0.99 (6/24*0 + 18/24*0.85) = 0.35 66 39

Information gain of a numeric attribute Age <30.5 >=30.5 InfoGain (S, Age 30.5 ) = 0.35 30.5 41.5 45.5 50.5 52.5 Age <41.5 >=41.5 Age <50.5 >=50.5 Age <66 >=66 Age <45.5 >=45.5 Age <52.5 >=52.5 66 40

Decision trees Many possible decision trees k is the number of binary attributes Heuristic search with information gain Information gain is short-sighted 41

Trees are shortsighted (1) A B C A xor B 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0 1 0 1 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 1 1 1 1 1 0 Three attributes: A, B and C Target variable is a logical combination attributes A and B class = A xor B Attribute C is random w.r.t. the target variable 42

Trees are shortsighted (2) attribute A alone attribute B alone attribute C alone Attribute C has the highest information gain! 43

Trees are shortsighted (3) Decision tree by ID3 The real model behind the data 44

Overcoming shortsightedness of decision trees Random forests (Breinmann & Cutler, 2001) A random forest is a set of decision trees Each tree is induced from a bootstrap sample of examples For each node of the tree, select among a subset of attributes All the trees vote for the classification See also ansemble learning ReliefF for attribute estimation (Kononenko el al., 1997) 45