Evaluation & Credibility Issues

Similar documents
Performance Evaluation

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Performance Evaluation

Performance evaluation of binary classifiers

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Evaluation Metrics for Intrusion Detection Systems - A Study

Model Accuracy Measures

Anomaly Detection. Jing Gao. SUNY Buffalo

Performance Evaluation and Hypothesis Testing

Hypothesis Evaluation

Machine Learning Concepts in Chemoinformatics

Stats notes Chapter 5 of Data Mining From Witten and Frank

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Introduction to Supervised Learning. Performance Evaluation

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Knowledge Discovery and Data Mining 1 (VO) ( )

Linear Classifiers as Pattern Detectors

Performance. Learning Classifiers. Instances x i in dataset D mapped to feature space:

Least Squares Classification

Evaluation. Andrea Passerini Machine Learning. Evaluation

Performance Evaluation and Comparison

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Probability and Statistics. Terms and concepts

Evaluation requires to define performance measures to be optimized

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Data Mining and Analysis: Fundamental Concepts and Algorithms

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Linear Classifiers as Pattern Detectors

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Bayesian Decision Theory

How do we compare the relative performance among competing models?

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Stephen Scott.

Present Practice, issues and headaches.

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Multiple regression: Categorical dependent variables

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Performance Metrics for Machine Learning. Sargur N. Srihari

Classification Part 4. Model Evaluation

Performance Evaluation

Computational paradigms for the measurement signals processing. Metodologies for the development of classification algorithms.

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Pointwise Exact Bootstrap Distributions of Cost Curves

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Large-scale Data Linkage from Multiple Sources: Methodology and Research Challenges

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Diagnostics. Gad Kimmel

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Supplementary Information

Data Analytics for Social Science

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Data Mining algorithms

Applied Machine Learning Annalisa Marsico

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Anomaly Detection for the CERN Large Hadron Collider injection magnets

An Analysis of Reliable Classifiers through ROC Isometrics

Data Mining and Knowledge Discovery: Practice Notes

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Cost-based classifier evaluation for imbalanced problems

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Tolerating Broken Robots

CS395T Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics. Unit 18: Support Vector Machines (SVMs)

Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Online Advertising is Big Business

CHAPTER 4: PREDICTION AND ESTIMATION OF RAINFALL DURING NORTHEAST MONSOON

Application of Jaro-Winkler String Comparator in Enhancing Veterans Administrative Records

CSC314 / CSC763 Introduction to Machine Learning

Precision-Recall-Gain Curves: PR Analysis Done Right

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

CC283 Intelligent Problem Solving 28/10/2013

Machine Learning in Action

Given a feature in I 1, how to find the best match in I 2?

Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015

Machine Learning - Michaelmas Term 2016 Lectures 9, 10 : Support Vector Machines and Kernel Methods

Ricco Rakotomalala.

Classifica(on and predic(on omics style. Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Classifier performance evaluation

SINGLE-SEQUENCE PROTEIN SECONDARY STRUCTURE PREDICTION BY NEAREST-NEIGHBOR CLASSIFICATION OF PROTEIN WORDS

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Statistics for classification

2p or not 2p: Tuppence-based SERS for the detection of illicit materials

CHAPTER 5 EEG SIGNAL CLASSIFICATION BASED ON NN WITH ICA AND STFT

Machine Learning for natural language processing

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Lecture 3 Classification, Logistic Regression

A New Unsupervised Event Detector for Non-Intrusive Load Monitoring

Decision trees. Decision tree induction - Algorithm ID3

CSC 411: Lecture 03: Linear Classification

Evaluation of Intrusion Detection using a Novel Composite Metric

Variations of Logistic Regression with Stochastic Gradient Descent

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Transcription:

Evaluation & Credibility Issues What measure should we use? accuracy might not be enough. How reliable are the predicted results? How much should we believe in what was learned? Error on the training data is not a good indicator of performance on future data. The classifier was computed from the very same training data, any estimate based on that data will be optimistic. 1

Evaluation Questions How to evaluate the performance of a model? How to obtain reliable estimates of performance? How to compare the relative performance among competing models? Given two equally performing models, which one should we prefer? 2

Metrics for Performance Evaluation Focus on the predictive capability of a model. Confusion matrix: Actual Class Predicted Class + - + f ++ (TP) f + (FN) - f + (FP) f (TN) 3

Accuracy Actual Class Predicted Class + - + f ++ (TP) f + (FN) - f + (FP) f (TN) The most widely-used metric is accuracy: Accuracy = a + d a + b + c + d = TP + TN TP + TN + FP + FN 4

Misleading Accuracy Consider a two-class problem: Number of class 0 instances = 9990 Number of class 1 instances = 10 Suppose a model predicts everything to be class 0. It s accuracy is 9990/10000=99.9%. It s accuracy is misleading, because the model does not predict any class 1 instance. 5

Cost Matrix Predicted Class + - Actual Class + C + + C + - C + C C i j is the cost of misclassifying a class j instance as class i 6

Computing the Cost of Actual Class Predicted Class + - + 1 100-1 0 Predicted Class Predicted Class + - + - Actual Class + 150 40-60 250 Actual Class + 250 45-5 200 7 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

Accuracy Actual Class Predicted Class + - + f ++ (TP) f + (FN) - f + (FP) f (TN) True positive (TP) or f ++ : positive instances correctly predicted. False negative (FN) or f + : positive instances wrongly predicted. False positive (FP) or f + : negative instances wrongly predicted. True negative(tn) or f : negative instances correctly predicted. 8

Confusion Matrix Actual Class Predicted Class + - + f ++ (TP) f + (FN) - f + (FP) f (TN) True positive rate (TPR): fraction of positive instances correctly predicted. False positive rate (FPR): fraction of positive instances wrongly predicted. False negative rate (FNR): fraction of negative instances wrongly predicted. True negative rate (TNR): fraction of negative instances correctly predicted. 9

Confusion Matrix Actual Class Predicted Class + - + f ++ (TP) f + (FN) - f + (FP) f (TN) True positive rate (TPR): TPR = TP/ TP + FN. False positive rate (FPR): FPR = FP/ TN + FP. False negative rate (FNR): FNR = FN/ TP + FN. True negative rate (TNR): TNR = TN/ TN + FP. 10

Cost-Sensitive Measures Precision p = TP TP + FP Recall r = TP TP + FN As precision, false positives (TN). As recall, false negatives (FN). 11

F 1 Measure F 1 measure = 2rp r + p = 2 TP 2 TP + FP + FN F 1 measure = 2 1 r + 1 p As F 1 measure, false positives (FP ) and false negatives (FN). 12

F β Measure F β measure = β2 + 1 rp r + β 2 p Both precision and recall are special cases of F β where β = 0 and β =, respectively. Low values of β make F β closer to precision; high values make it closer to recall. 13

Precision and Recall Precision, p = Recall, r = F 1 measure = 2rp r + p = TP TP+FP TP TP+FN 2 TP 2 TP + FP + FN F 1 measure = 2 1 r + 1 p 14

Receiver Operating Characteristic (ROC) Developed in the 1950s for signal detection theory to analyze noisy signals. ROC curve plots TP (on the y-axis) against FP (on the x-axis). Performance of each classifier represented as a point on the ROC curve. Changing the threshold of algorithm, sample distribution, or cost matrix changes the location of the point. 15

ROC Curve 16

ROC Curve 17

Generating ROC Curves 18

Generating ROC Curves 19

Dominating Classifiers in ROC Space 20

Area Under the ROC Curve 21

Precision-Recall Curves 22

ROC and PR Curves 23

ROC is Skew Insensitive Prediction Class I Class II 1.0 μ 0 = max 1 1 μ 1 = μ 0 ϵ 1 1 0 0.8 μ 2 = μ 1 ϵ 2 0 0 μ 3 = μ 2 ϵ 3 0 0 μ 4 = μ 3 ϵ 4 1 1 μ 5 = μ 4 ϵ 5 1 0 0.6 0.4 Curve I Curve II μ 6 = μ 5 ϵ 6 0 0 μ 7 = μ 6 ϵ 7 0 0 0.2 24...... μ n = μ n 1 ϵ n.. 0.0 0.0 0.5 1.0 Curve I AUROC = 0.563 Curve II AUROC = 0.563

Precision-Recall is Skew Sensitive Prediction Class I Class II 1.0 μ 0 = max 1 1 μ 1 = μ 0 ϵ 1 1 0 0.8 μ 2 = μ 1 ϵ 2 0 0 μ 3 = μ 2 ϵ 3 0 0 μ 4 = μ 3 ϵ 4 1 1 μ 5 = μ 4 ϵ 5 1 0 0.6 0.4 Curve I Curve II μ 6 = μ 5 ϵ 6 0 0 μ 7 = μ 6 ϵ 7 0 0 0.2 25...... μ n = μ n 1 ϵ n.. 0.0 0.0 0.5 1.0 Curve I AUPR = 0.596 Curve II AUPR = 0.370

Optimizing the AUROC vs. AUPR Curve I AUROC: 0.813 AUPR: 0. 514 Curve II AUROC: 0. 875 AUPR: 0.038 26

Lift Charts Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. The greater the area between the lift curve and the baseline, the better the model. 27

Example: Direct Marketing Mass mailout of promotional offers (1,000,000). The proportion who normally respond is 0.1% (1,000). A data mining tool can identify a subset of a 100,000 for which the response rate is 0.4% (400). In marketing terminology, the increase of response rate is known as the lift factor yielded by the model. The same data mining tool may be able to identify 400,000 households for which the response rate is 0.2% (800). The overall goal is to find subsets of test instances that have a 28 high proportion of true positives.

Example: Direct Marketing 29

30