Performance. Learning Classifiers. Instances x i in dataset D mapped to feature space:

Similar documents
Data Mining and Knowledge Discovery: Practice Notes

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Decision trees. Decision tree induction - Algorithm ID3

Evaluation & Credibility Issues

Slides for Data Mining by I. H. Witten and E. Frank

Machine Learning Chapter 4. Algorithms

Performance Evaluation

Machine Learning 2nd Edi7on

Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

CS 6375 Machine Learning

CC283 Intelligent Problem Solving 28/10/2013

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Data classification (II)

Performance Evaluation

Performance Evaluation

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees Entropy, Information Gain, Gain Ratio

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Learning Decision Trees

Decision Trees. Introduction. Some facts about decision trees: They represent data-classification models.

Notes on Machine Learning for and

Decision Tree Learning

EECS 349:Machine Learning Bryan Pardo

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Evaluation. Andrea Passerini Machine Learning. Evaluation

CS145: INTRODUCTION TO DATA MINING

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Evaluation requires to define performance measures to be optimized

CSCI 5622 Machine Learning

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Decision Trees / NLP Introduction

Machine Learning & Data Mining

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Learning Decision Trees

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Algorithms for Classification: The Basic Methods

day month year documentname/initials 1

Data Mining and Analysis: Fundamental Concepts and Algorithms

C4.5 - pruning decision trees

Chapter 14 Combining Models

Decision Trees. Tirgul 5

Decision Tree Learning

The Solution to Assignment 6

Introduction to Machine Learning CMU-10701

Adaptive Crowdsourcing via EM with Prior

Index. Cambridge University Press Relational Knowledge Discovery M E Müller. Index. More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Machine Learning, Fall 2011: Homework 5

Decision Support. Dr. Johan Hagelbäck.

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

EVALUATING RISK FACTORS OF BEING OBESE, BY USING ID3 ALGORITHM IN WEKA SOFTWARE

Knowledge Discovery and Data Mining

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

Evaluation Metrics for Intrusion Detection Systems - A Study

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Classification and regression trees

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Performance evaluation of binary classifiers

Classification: Decision Trees

Diagnostics. Gad Kimmel

Model Accuracy Measures

Classification Using Decision Trees

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Hypothesis Evaluation

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Generalization Error on Pruning Decision Trees

Decision Tree Learning Lecture 2

Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Symbolic methods in TC: Decision Trees

Typical Supervised Learning Problem Setting

Performance Evaluation and Hypothesis Testing

Holdout and Cross-Validation Methods Overfitting Avoidance

Classification: Decision Trees

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Decision trees COMS 4771

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Linear Classifiers as Pattern Detectors

Stephen Scott.

Artificial Intelligence Decision Trees

Applied Machine Learning Annalisa Marsico

Decision Trees. Danushka Bollegala

Machine Learning 3. week

Rule Generation using Decision Trees

Pointwise Exact Bootstrap Distributions of Cost Curves

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Classification and Regression Trees

Statistics and learning: Big Data

CSCI-567: Machine Learning (Spring 2019)

Transcription:

Learning Classifiers Performance Instances x i in dataset D mapped to feature space: Initial Model (Assumptions) Training Process Trained Model Testing Process Tested Model Performance Performance Training Data Test Data decision boundary Classes associated with instances: X, Classification: f(x i ) = c {X, } with x i,j {, }, and f classifier dataset D is a multiset Objective: learn f (supervised) Performance measures: TP: True Positive TN: True Negative FP: False Positive FN: False Negative Note: if N = D, then N = TP+TN+FP+FN Confusion matrix: Predicted class Actual true positive false negative class false positive true negative Performance Performance measures: Success rate σ: TP + TN σ = N Error rate ɛ: ɛ = 1 σ TPR (= recall ρ) True Positive Rate TPR = TP/(TP + FN) FNR False Negative Rate: FNR = 1 TPR FPR False Positive Rate: FPR = FP/(FP + TN) TNR True Negative Rate: TNR = 1 FPR Precision π: π = TP/(TP + FP) F -measure: F = 2 ρ π ρ + π = 2TP 2TP + FP + FN Example: choosing contact lenses Tear production Age prescription Ast rate Lens young myope reduced ne young myope rmal soft young myope reduced ne young myope rmal hard young hypermetrope reduced ne young hypermetrope rmal soft young hypermetrope reduced ne young hypermetrope rmal hard pre-presbyopic myope reduced ne pre-presbyopic myope rmal soft pre-presbyopic myope reduced ne pre-presbyopic myope rmal hard pre-presbyopic hypermetrope reduced ne pre-presbyopic hypermetrope rmal soft pre-presbyopic hypermetrope reduced ne pre-presbyopic hypermetrope rmal ne presbyopic myope reduced ne presbyopic myope rmal ne presbyopic myope reduced ne presbyopic myope rmal hard presbyopic hypermetrope reduced ne presbyopic hypermetrope rmal soft presbyopic hypermetrope reduced ne presbyopic hypermetrope rmal ne Ast = Astigmatism

Rule representation and reasoning Rule representation: Logical implication (= rules) LHS RHS (LHS = left-hand side = antecedent; RHS = right-hand side = consequent) Literals in LHS and RHS are of the form: Variable value where {<,, =, >, } Rule-based reasoning: where R F C (or Attribute value) R is a set of rules r R (rule-base) F is a set of facts of the form Variable = value C is a set of conclusions of the same form as facts Basic ideas: ZeroR Construct rule that predicts the majority class Used as baseline performance Example Contact lenses recommendation rule: Lens = ne Total number of instances: 24 Correctly classified instances: 15 (625%) Incorrectly classified instances: 9 (375%) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0 0 0 0 0 soft 0 0 0 0 0 hard 1 1 0625 1 0769 ne === Confusion Matrix === a b c <-- classified as 0 0 5 a = soft 0 0 4 b = hard 0 0 15 c = ne OneR OneR: Example Construct a single-condition rule for each variable-value pair Select the rules defined for a single variable (in the condition) which perform best OneR(classvar) { R for each var Vars do for each value Domain(var) do classvarmost-freq-value MostFreq(varvalue, classvar) rule MakeRule(varvalue, classvarmost-freq-value) R R {rule} for each r R do CalculateErrorRate(r) R SelectBestRulesForSingleVar(R) } Rules for contact-lenses recommendation: Tears = reduced Lens = ne Tears = rmal Lens = soft 17/24 instances correct Correctly classified instances: 17 (7083%) Incorrectly classified instances: 7 (2916%) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 1 0368 0417 1 0588 soft 0 0 0 0 0 hard 08 0 1 08 0889 ne === Confusion Matrix === a b c <-- classified as 5 0 0 a = soft 4 0 0 b = hard 3 0 12 c = ne

Generalisation: separate-and-cover Covering of classes: Rule-set generation for each class value separately Peeling: box compression instances are peeled off (fall outside the box) one face at the time PRISM algorithm Example: choosing contact lenses Recommended contact lenses: ne, soft, hard General principles: 1 Choose class value, eg hard 2 Construct rule Condition Lens = hard 3 Determine accuracy α = p/t for all possible conditions, where t: total number of instances covered by the rule p: covered instances with the right (positive) class value Condition α = p/t Age = youg 2/8 Age = pre-presbyopic 1/8 Age = presbyopic 1/8 s = myope 3/12 s = hypermetrope 3/12 Astigmatism = 0/12 Astigmatism = 4/12 Tears = reduced 0/12 Tears = rmal 4/12 4 Select best condition (4/12) Separate-and-cover algorithm SC(classvar, D) { R for each val Domain(classvar) do E D while E contains instances with val do rule MakeRule(rhs(classvarval), lhs( )) IR until rule is perfect do for each var Vars, rule IR : var / rule do for each value Domain(var) do inter-rule Add(rule, lhs(varvalue)) IR IR {inter-rule} rule SelectRule(IR) R R {rule} RC InstancesCoveredBy(rule, E) E E\RC } SelectRule: based on accuracy α = p/t; if α = α, for two rules, select the one with highest p Rule: SelectRule example RHS: Lens = hard LHS: Astigmatism =, with α = 4/12 Not very accurate; expanded rule: (Astigmatism = New-condition) Lens = hard Tear product Age prescription Ast rate Lens young myope reduced ne young myope rmal hard young hypermetrope reduced ne young hypermetrope rmal hard pre-presbyopic myope reduced ne pre-presbyopic myope rmal hard pre-presbyopic hypermetrope reduced ne pre-presbyopic hypermetrope rmal ne presbyopic myope reduced ne presbyopic myope rmal hard presbyopic hypermetrope reduced ne presbyopic hypermetrope rmal ne Age = young (2/4); Age = pre-presbyopic (1/4); Age = presbyopic (1/4) s = myope (3/6); s = hypermetrope (1/6) Tears = reduced (0/6); Tears = rmal (4/6)

SelectRule example (continued) Rule: (Astigmatism = Tears = rmal) Lens = hard Expanded rule: (Astigmatism = Tears = rmal New-condition) Lens = hard Tear product Age prescription Ast rate Lens young myope rmal hard young hypermetrope rmal hard pre-presbyopic myope rmal hard pre-presbyopic hypermetrope rmal ne presbyopic myope rmal hard presbyopic hypermetrope rmal ne Age = young (2/2); Age = pre-presbyopic (1/2); Age = presbyopic (1/2) s = myope (3/3); s = hypermetrope (1/3) (Astigmatism = Tears = rmal s = myope) Lens = hard SC example (continued) Delete 3 instances from E; new rule: New-condition Lens = hard Tear production Age prescription Ast rate Lens young myope reduced ne young myope rmal soft young myope reduced ne young hypermetrope reduced ne young hypermetrope rmal soft young hypermetrope reduced ne young hypermetrope rmal hard pre-presbyopic myope reduced ne pre-presbyopic myope rmal soft pre-presbyopic myope reduced ne pre-presbyopic hypermetrope reduced ne pre-presbyopic hypermetrope rmal soft pre-presbyopic hypermetrope reduced ne pre-presbyopic hypermetrope rmal ne presbyopic myope reduced ne presbyopic myope rmal ne presbyopic myope reduced ne presbyopic hypermetrope reduced ne presbyopic hypermetrope rmal soft presbyopic hypermetrope reduced ne presbyopic hypermetrope rmal ne Rules: WEKA Results If Astigmatism = and Tears = rmal and s = hypermetrope then Lens = soft If Astigmatism = and Tears = rmal and age = young then Lens = soft If age = pre-presbyopic and Astigmatism = and Tears = rmal then Lens = soft If Astigmatism = and Tears = rmal and s = myope then Lens = hard If age = young and Astigmatism = and Tears = rmal then Lens = hard If Tears = reduced then ne If age = presbyopic and Tears = rmal and s = myope and Astigmatism = then Lens = ne If s = hypermetrope and Astigmatism = and age = pre-presbyopic then Lens = ne If age = presbyopicand s = hypermetrope and Astigmatism = then Lens = ne === Confusion Matrix === a b c <-- classified as 5 0 0 a = soft 0 4 0 b = hard 0 0 15 c = ne Correctly classified instances: 24 (100%) Limitations Adding one condition at the time is greedy search ( optimal state may be missed) Accuracy α = p/t: promotes overfitting: the more correct (higher p compared to t) is, the higher α Resulting rules cover all instances perfectly Example Consider rule r 1 with accuracy α 1 = 1/1 and rule r 2 with accuracy α 2 = 19/20, then r 1 is considered superior to r 2 Alternative 1: information gain Alternative 2: probabilistic measure

Information gain where I D (r) = p [ log p t log p ] t α = p/t is the accuracy before adding a condition to r α = p /t is the accuracy after a condition has been added to r Example Consider rule r with α = 1/1 and rule r with accuracy α = 19/20, both modifications of r with α = 20/200 Then is r considered superior to r according to accuracy, but I D (r ) = 1[log(1/1) log(20/200)] = 1 I D (r ) = 19[log(19/20) log(20/200)] 186 hence r is inferior to r according to information gain Comparison accuracy versus information gain Information gain I: Emphasis is on large number of positive instances High coverage cases first, special cases later Resulting rules cover all instances perfectly Accuracy α: Takes number of positive instances only into account if ties break Special cases first, high coverage cases later Resulting rules cover all instances perfectly Probabilistic measure N instance in D Probabilistic measure N instance in D rule r selects t instances M positive concerning the class rule r selects t instances M positive concerning the class p instances concern the class N = D : # instances in dataset D M: # instances in D concerning a class t: # instances in D on which rule r succeeds p: # positive instances Hypergeometric distribution: ( )( ) M N M k t k f(k) = ( ) N t sampling without replacement: probability that k instances out of t belong to the class p instances concern the class Hypergeometric distribution: ( )( ) M N M k t k f(k) = ( ) N t Rule r selects t instances, of which p are positive Probability that a randomly chosen rule r does as well or better than r: ( )( ) min{t,m} P (r min{t,m} M N M k t k ) = f(k) = ( ) N k=p k=p t

P (r ) = Approximation min{t,m} k=p min{t,m} k=p ( )( ) M N M k t k ( ) N t ( t k ) ( M N ) k ( 1 M ) t k N ie hypergeometric distribution approximated by a bimial distribution = I M/N (p, t p + 1) where I x (α, β) is the incomplete beta function: I x (α, β) = 1 x B(α, β) 0 zα 1 (1 z) β 1 dz where B(α, β) is the beta function, defined as 1 B(α, β) = 0 zα 1 (1 z) β 1 dz Reduced-error pruning Danger of overfitting to training set can be reduced by splitting this into: a growing set (GS) (2/3 of training set) a pruning set (PS) (1/3 of training set) REP(classvar, D) { R ; E D (GS, PS) Split(E) while E do IR for each val Domain(classvar) do if GS and PS contain a val-instance then rule BSC(classvarval, GS) while P (rule PS) > P (rule PS) do rule rule IR IR {rule} rule SelectRule(IR); R R {rule} RC InstancesCoveredBy(rule, E) E E\RC (GS, PS) Split(E) } BSC is basic separate-and-cover algorithm, and rule is a rule with last condition removed Divide-and-conquer: decision trees young ne Age Astigmatism soft soft Age hard prescription young pre prebyopic myope hypermetrope presbyopic ne reduced presbyopic Tears pre prebyopic soft rmal Learning decision trees: hypermetrope Lens =? prescription myope hard ne ne R Quinlan: ID3, C45 and C50 L Breiman: CART (Classification and Regression Trees) Which variable/attribute is best? X D X = Dataset D: C X class variable D = D X= D X= Entropy: X DX = C H C (X =x) = P (C =c X =x) ln P (C =c X =x) c Expected entropy: E HC (X) = P (X = x)h C (X = x) x

Dataset D: Information gain (again) D = D X= D X= Entropy: H C (X =x) = c Expected entropy: E HC (X) = x P (C =c X =x) ln P (C =c X =x) P (X = x)h C (X = x) Without the split of the dataset D on variable X, the entropy is: H C ( ) = c P (C = c) ln P (C = c) Information gain G C from X: G C (X) = H C ( ) E HC (X) Example: contact lenses recommendation Class variable is Lens: 5/24 if Lens = soft P (Lens) = 4/24 if Lens = hard 15/24 if Lens = ne H( ) = 5 24 ln 5 24 4 24 ln 4 24 15 24 092 For variable Ast (Astigmatism): ln 15 24 5/12 if Lens = soft P (Lens Ast = ) = 0/12 if Lens = hard 7/12 if Lens = ne Therefore: H(Ast=) = 5 12 ln 5 12 0 12 ln 0 12 7 12 ln 7 12 068 Example (continued) For variable Ast (Astigmatism): 0/12 if Lens = soft P (Lens Ast = ) = 4/12 if Lens = hard 8/12 if Lens = ne Therefore: H(Ast=) = 0 12 ln 0 12 4 12 ln 4 12 8 12 ln 8 12 064 E H (Ast) = 1 2 H(Ast=) + 1 2 H(Ast=) Information gain: = 1/2(068 + 064) = 066 G(Ast) = H( ) E H (Ast) = 092 066 = 026 Example (continued) For variable Tears: E H (Tears) = 1 2 H(Tears=red) + 1 2 H(Tears=rm) 1/2(00 + 11) = 055 Information gain: G(Tears) = H( ) E H (Tears) = 092 055 = 037 Comparison: G(Tears) > G(Ast) Select Tears as first splitting variable

Final remarks I A de with too many branches causes the information gain measure to break down Example Suppose that with each branch of a de a dataset with exactly one instance is associated: E HC (X) = n 1/n (1 log 1 + 0 log 0) = 0 if X has n values Hence, G C (X) = H C ( ) 0 = H C ( ) attains a maximum Solution: Gain ratio R C : Split information H X ( ) = x Gain ratio: R C (X) = G C (X)/H X ( ) P (X = x) ln P (X = x) Final remarks II Variable selection is myopic: it does t look beyond the effects of its own values; a resulting decision tree is therefore likely to be suboptimal Decision trees may grow unwieldy, and may need to be pruned (ID3 C45) subtree replacement subtree raising Decision trees can also be used for numerical variables: regression trees