Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

Similar documents
Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Evaluation & Credibility Issues

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Performance Evaluation

Performance Evaluation

Performance evaluation of binary classifiers

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Chapter 10 Logistic Regression

Probability and Statistics. Terms and concepts

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Anomaly Detection. Jing Gao. SUNY Buffalo

Model Accuracy Measures

Evaluation requires to define performance measures to be optimized

Machine Learning Concepts in Chemoinformatics

Evaluation. Andrea Passerini Machine Learning. Evaluation

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Least Squares Classification

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Applied Machine Learning Annalisa Marsico

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Stephen Scott.

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CSC314 / CSC763 Introduction to Machine Learning

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Hypothesis tests

Data Mining and Analysis: Fundamental Concepts and Algorithms

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Present Practice, issues and headaches.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Hypothesis Testing and Confidence Intervals (Part 2): Cohen s d, Logic of Testing, and Confidence Intervals

Bayesian Decision Theory

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Performance Evaluation and Comparison

Linear Classifiers as Pattern Detectors

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Online Advertising is Big Business

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

Stats notes Chapter 5 of Data Mining From Witten and Frank

Review of Statistics 101

Null Hypothesis Significance Testing p-values, significance level, power, t-tests Spring 2017

Evaluation Metrics for Intrusion Detection Systems - A Study

Linear Classifiers as Pattern Detectors

Introduction to Supervised Learning. Performance Evaluation

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Pointwise Exact Bootstrap Distributions of Cost Curves

Diagnostics. Gad Kimmel

Lecture 3 Classification, Logistic Regression

10: Crosstabs & Independent Proportions

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

hypothesis a claim about the value of some parameter (like p)

Performance Evaluation

Section 9.1 (Part 2) (pp ) Type I and Type II Errors

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

11-2 Multinomial Experiment

Against the F-score. Adam Yedidia. December 8, This essay explains why the F-score is a poor metric for the success of a statistical prediction.

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Hypothesis Evaluation

Performance. Learning Classifiers. Instances x i in dataset D mapped to feature space:

Multiple regression: Categorical dependent variables

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Classifier performance evaluation

Null Hypothesis Significance Testing p-values, significance level, power, t-tests

Jeffrey D. Ullman Stanford University

Chapter Three. Hypothesis Testing

Section 10.1 (Part 2 of 2) Significance Tests: Power of a Test

Lecture 4 Discriminant Analysis, k-nearest Neighbors

HYPOTHESIS TESTING. Hypothesis Testing

Chart types and when to use them

Performance Evaluation and Hypothesis Testing

Superiority by a Margin Tests for One Proportion

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Algorithms for Classification: The Basic Methods

How do we compare the relative performance among competing models?

Quantitative Analysis and Empirical Methods

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

Knowledge Discovery and Data Mining 1 (VO) ( )

Sociology 6Z03 Review II

Tests for Two Correlated Proportions in a Matched Case- Control Design

STAT 212 Business Statistics II 1

Tests for Two Coefficient Alphas

Midterm 2 - Solutions

Reading for Lecture 6 Release v10

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

Classification: Analyzing Sentiment

Multivariate statistical methods and data mining in particle physics

Last week: Sample, population and sampling distributions finished with estimation & confidence intervals

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Statistical Models of the Annotation Process

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

Finding the Gold in Your Data

Last two weeks: Sample, population and sampling distributions finished with estimation & confidence intervals

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Transcription:

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn 1 Some overheads from Galit Shmueli and Peter Bruce 2010

Most accurate Best! Actual value Which is more accurate?? 2

Why Evaluate a classifier? Multiple methods are available to classify or predict For each method, multiple choices are available for settings To choose best model, need to assess each model s performance When you use it, how good will it be? How reliable are the results? 3

Misclassification error for discrete classification problems Error = classifying a record as belonging to one class when it belongs to another class. Error rate = percent of misclassified records out of the total records in the validation data 4

examples 5

Confusion Matrix Classification Confusion Matrix Predicted Class Actual Class 1 0 1 201 85 0 25 2689 201 1 s correctly classified as 1 85 1 s incorrectly classified as 0 25 0 s incorrectly classified as 1 2689 0 s correctly classified as 0 6 Note asymmetry: Many more 0s than 1s Generally 1 = interesting or rare case

Naïve Rule Naïve rule: classify all records as belonging to the most prevalent class Often used as benchmark: we hope to do better than that Exception: when goal is to identify high-value but rare outcomes, we may prefer to do worse than the naïve rule (see lift later) 7

Evaluation Measures for Classification The Contingency Table/Confusion Matrix TP, FP, FN, TN are absolute counts of true positives, false positives, false negatives and true negatives N-samplesize N + = FN + TP number of positive examples N = FP + TN number of negative examples O + = TP + FP number of positive predictions O = FN + TN number of negative predictions Memorial Sloan-Kettering Cancer Center x = inputs f(x) = prediction Y = true result outputs\ labeling y =+1 y = 1 f (x) =+1 TP FP O + f (x) = 1 FN TN O N + N N c Gunnar Rätsch (cbio@mskcc) Introduction to Kernels @ MLSS 2012, Santa Cruz 61 8

Error Rate Overall Classification Confusion Matrix Predicted Class Actual Class 1 0 1 201 85 0 25 2689 Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 err = (201+2689) = 96.33% Expand to n x n matrix, where n = # of alternatives If multiple classes, error rate is: (sum of misclassified records)/(total records) 9

Cutoff for classification Most DM algorithms classify via a 2-step process: For each record, 1. Compute probability of belonging to class 1 2. Compare to cutoff value, and classify accordingly Default cutoff value is often chosen = 0.50 If >= 0.50, classify as 1 If < 0.50, classify as 0 May prefer a different cutoff value 10

11

Example: 24 cases classified. Est prob. Range.004 to.996 Est. Obs. # probability Actual Obs. # Est Prob. Actual 1 0.996 1 13 0.506 0 2 0.988 1 14 0.471 0 3 0.984 1 15 0.337 0 4 0.98 1 16 0.218 1 5 0.948 1 17 0.199 0 6 0.889 1 18 0.149 0 7 0.949 1 19 0.048 0 8 0.762 0 20 0.038 0 9 0.707 1 21 0.025 0 10 0.681 0 22 0.022 0 11 0.656 1 23 0.016 0 12 0.622 0 24 0.004 0 12

Set cutoff = 50% Forecast = 13 type 1s, ID 1-13 Obs. # Est. probabilit y Obs. # Est Prob. Actual Actual 1 0.996 13 0.506 2 0.988 14 0.471 3 0.984 15 0.337 4 0.98 16 0.218 5 0.948 17 0.199 6 0.889 18 0.149 7 0.949 19 0.048 8 0.762 20 0.038 9 0.707 21 0.025 10 0.681 22 0.022 11 0.656 23 0.016 12 0.622 24 0.004 13

Set cutoff = 50% Forecast = 13 type 1s, ID 1-13 Est. Obs. # probability Actual Obs. # Est Prob. Actual 1 0.996 1 13 0.506 0 2 0.988 1 14 0.471 0 3 0.984 1 15 0.337 0 4 0.98 1 16 0.218 1 5 0.948 1 17 0.199 0 6 0.889 1 18 0.149 0 7 0.949 1 19 0.048 0 8 0.762 0 20 0.038 0 9 0.707 1 21 0.025 0 10 0.681 0 22 0.022 0 11 0.656 1 23 0.016 0 12 0.622 0 24 0.004 0 14 Result = 4 false positives, 1 false negative, 5/24 total errors

Set cutoff = 70% Forecast = 9 type 1s, ID 1-9 Est. Obs. # probability Actual Obs. # Est Prob. Actual 1 0.996 1 13 0.506 0 2 0.988 1 14 0.471 0 3 0.984 1 15 0.337 0 4 0.98 1 16 0.218 1 5 0.948 1 17 0.199 0 6 0.889 1 18 0.149 0 7 0.949 1 19 0.048 0 8 0.762 0 20 0.038 0 9 0.707 1 21 0.025 0 10 0.681 0 22 0.022 0 11 0.656 1 23 0.016 0 12 0.622 0 24 0.004 0 15 Result = 1 false positives, 2 false negative, (we were lucky)

Which is better?? High cutoff (70%) => more false negatives, fewer false positives Lower cutoff (50%) gives opposite Which better is not a stat question; it s a business application question Total number of errors is usually NOT the right Which kind of error is more serious??? Only rarely are the costs the same 16 Statistical decision theory, Bayesian logic

When One Class is More Important In many cases it is more important to identify members of one class Tax fraud Credit default Response to promotional offer Detecting electronic network intrusion Predicting delayed flights 17 In such cases, we are willing to tolerate greater overall error, in return for better identifying the important class for further attention

Always a tradeoff: false + vs. false - How do we estimate these two types of errors? Answer: with our set-aside data Validation data set. 18

Evaluation Measures for Classification The Contingency Table/Confusion Matrix TP, FP, FN, TN are absolute counts of true positives, false positives, false negatives and true negatives N-samplesize N + = FN + TP number of positive examples N = FP + TN number of negative examples O + = TP + FP number of positive predictions O = FN + TN number of negative predictions Memorial Sloan-Kettering Cancer Center x = inputs f(x) = prediction Y = true result outputs\ labeling y =+1 y = 1 f (x) =+1 TP FP O + f (x) = 1 FN TN O N + N N c Gunnar Rätsch (cbio@mskcc) Introduction to Kernels @ MLSS 2012, Santa Cruz 61 19

None of these is a hypothesis test. (For example, is p(positive) >0? ) Reason: we don t care about the overall; we care about prediction of cases. Common performance measures Evaluation Measures for Classification II Several commonly used performance measures Name Computation Accuracy ACC = TP+TN N Error rate (1-accuracy) ERR = FP+FN N Balanced error rate BER = 1 FN 2 FN+TP + Weighted relative accuracy WRACC = TP TP+FN 2 TP F1 score F1 = Memorial Sloan-Kettering Cancer Center FP FP+TN FP FP+TN 2 TP+FP+FN p (TP+FP)(TP+FN)(TN+FP)(TN+FN) Cross-correlation coe cient CC = TP TN FP FN Sensitivity/recall TPR = TP/N + = TP TP+FN Specificity TNR = TN/N = TN TN+FP 1-sensitivity FNR = FN/N + = FN FN+TP 1-specificity FPR = FP/N = FP FP+TN P.p.v. / precision PPV = TP/O + = TP TP+FP False discovery rate FDR = FP/O + = FP FP+TP 20 c Gunnar Rätsch (cbio@mskcc) Introduction to Kernels @ MLSS 2012, Santa Cruz 62

Alternate Accuracy Measures If C 1 is the important class, Sensitivity = % of C 1 class correctly classified Specificity = % of C 0 class correctly classified False positive rate = % of predicted C 1 s that were not C 1 s False negative rate = % of predicted C 0 s that were not C 0 s 21

Medical terms In medical diagnostics, test sensitivity is the ability of a test to correctly identify those with the disease (true positive rate), whereas test specificity is the ability of the test to correctly identify those without the disease (true negative rate).[2] 22

Significance vs power (statistics) 23 False positive rate (α) = P(type I error) = 1 specificity =FP / (FP + TN) = Significance level For simple hypotheses, this is the test's probability of incorrectly rejecting the null hypothesis. The false positive rate. False negative rate (β) = type II error = 1 sensitivity = FN / (TP + FN) = 10 / (20 + 10) Power = sensitivity = 1 β Some The overheads test's from Galit probability Shmueli and Peter of Bruce correctly 2010 rejecting the null hypothesis. The complement of the false negative rate, β.

Evaluation Measures for Classification III Evaluation Measures for Classification III [left] ROC Curve [right] Precision Recall Curve true positive rate 1 proposed method ROC eponine mcpromotor firstef proposed method firstef 0.1 eponine mcpromotor 0.01 0 0.2 0.4 0.6 0.8 1 false positive rate positive predictive value 1 0.1 mcpromotor PPV proposed method firstef eponine proposed method firstef eponine mcpromotor 0 0.2 0.4 0.6 0.8 1 true positive rate (Obtained by varying bias and recording TPR/FPR or PPV/TPR.) Use bias independent scalar evaluation measure Area under ROC Curve (auroc) Area under Precision Recall Curve (auprc) 24

ROC Curve False positive rate 25

Lift and Decile Charts: Goal Useful for assessing performance in terms of identifying the most important class Helps evaluate, e.g., How many tax records to examine How many loans to grant How many customers to mail offer to 26

Lift and Decile Charts Cont. Compare performance of DM model to no model, pick randomly Measures ability of DM model to identify the important class, relative to its average prevalence Charts give explicit assessment of results over a large number of cutoffs 27

Lift and Decile Charts: How to Use Compare lift to no model baseline In lift chart: compare step function to straight line In decile chart compare to ratio of 1 28

Lift Chart cumulative performance 14 Cumulative 12 10 8 6 4 2 0 0 10 20 30 Cumulative Ownership when sorted using predicted values Cumulative Ownership using average # cases After examining (e.g.,) 10 cases (x-axis), 9 owners (y-axis) have been correctly identified 29

Decile Chart Decile-wise lift chart (training dataset) Decile mean / Global mean 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 Deciles In most probable (top) decile, model is twice as likely to identify the important class (compared to avg. prevalence) 30

Lift Charts: How to Compute Using the model s classifications, sort records from most likely to least likely members of the important class Compute lift: Accumulate the correctly classified important class records (Y axis) and compare to number of total records (X axis) 31

Lift vs. Decile Charts Both embody concept of moving down through the records, starting with the most probable Decile chart does this in decile chunks of data Y axis shows ratio of decile mean to overall mean Lift chart shows continuous cumulative results Y axis shows number of important class records identified 32

Asymmetric Costs 33

Misclassification Costs May Differ The cost of making a misclassification error may be higher for one class than the other(s) Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s) 34

Example Response to Promotional Offer Suppose we send an offer to 1000 people, with 1% average response rate ( 1 = response, 0 = nonresponse) Naïve rule (classify everyone as 0 ) has error rate of 1% (seems good) With lower cutoff, suppose we can correctly classify eight 1 s as 1 s It comes at the cost of misclassifying twenty 0 s as 1 s and two 0 s as 1 s. 35

New Confusion Matrix Predict as 1 Predict as 0 Actual 1 8 2 Actual 0 20 970 Error rate = (2+20) = 2.2% (higher than naïve rate) 36

Introducing Costs & Benefits Suppose: Profit from a 1 is $10 Cost of sending offer is $1 Then: Under naïve rule, all are classified as 0, so no offers are sent: no cost, no profit Under DM predictions, 28 offers are sent. 8 respond with profit of $10 each 20 fail to respond, cost $1 each 972 receive nothing (no cost, no profit) Net profit = $60 37

Profit Matrix Predict as 1 Predict as 0 Actual 1 $80 0 Actual 0 ($20) 0 38

Lift (again) Adding costs to the mix, as above, does not change the actual classifications Better: Use the lift curve and change the cutoff value for 1 to maximize profit 39

Generalize to Cost Ratio Sometimes actual costs and benefits are hard to estimate Need to express everything in terms of costs (i.e., cost of misclassification per record) Goal is to minimize the average cost per record A good practical substitute for individual costs is the ratio of misclassification costs (e,g,, misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms ) 40

Minimizing Cost Ratio q 1 = cost of misclassifying an actual 1, q 0 = cost of misclassifying an actual 0 Minimizing the cost ratio q 1 /q 0 is identical to minimizing the average cost per record Software * may provide option for user to specify cost ratio 41 *Currently unavailable in XLMiner

Note: Opportunity costs As we see, best to convert everything to costs, as opposed to a mix of costs and benefits E.g., instead of benefit from sale refer to opportunity cost of lost sale Leads to same decisions, but referring only to costs allows greater applicability 42

Cost Matrix (inc. opportunity costs) Predict as 1 Predict as 0 Actual 1 $8 $20 Actual 0 $20 $0 Recall original confusion matrix (profit from a 1 = $10, cost of sending offer = $1): Predict as 1 Predict as 0 Actual 1 8 2 Actual 0 20 970 43

Adding Cost/Benefit to Lift Curve Sort records in descending probability of success For each case, record cost/benefit of actual outcome Also record cumulative cost/benefit Plot all records X-axis is index number (1 for 1 st case, n for n th case) Y-axis is cumulative cost/benefit Reference line from origin to y n ( y n = total net benefit) 44

Multiple Classes For m classes, confusion matrix has m rows and m columns Theoretically, there are m(m-1) misclassification costs, since any case could be misclassified in m-1 ways In practice, often too many to work with 45 In decision-making context, though, such complexity rarely arises one class is usually of primary interest

Negative slope to reference curve 46

Lift Curve May Go Negative If total net benefit from all cases is negative, reference line will have negative slope Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum 47

Oversampling and Asymmetric Costs Caution: most real DM problems fall have asymmetric costs! 48

Rare Cases Asymmetric costs/benefits typically go hand in hand with presence of rare but important class Responder to mailing Someone who commits fraud Debt defaulter Often we oversample rare cases to give model more information to work with Typically use 50% 1 and 50% 0 for training 49

Example Following graphs show optimal classification under three scenarios: assuming equal costs of misclassification assuming that misclassifying o is five times the cost of misclassifying x Oversampling scheme allowing DM methods to incorporate asymmetric costs 50

Classification: equal costs 51

Classification: Unequal costs 52

Oversampling Scheme Oversample o to appropriately weight misclassification costs 53

An Oversampling Procedure 1. Separate the responders (rare) from nonresponders 2. Randomly assign half the responders to the training sample, plus equal number of nonresponders 3. Remaining responders go to validation sample 4. Add non-responders to validation data, to maintain original ratio of responders to non-responders 5. Randomly take test set (if needed) from validation 54

Classification Using Triage Take into account a gray area in making classification decisions Instead of classifying as C 1 or C 0, we classify as C 1 C 0 Can t say The third category might receive special human review 55

Evaluating Predictive Performance 56

Measuring Predictive error Not the same as goodness-of-fit We want to know how well the model predicts new data, not how well it fits the data it was trained with Key component of most measures is difference between actual y and predicted y ( error ) 57

Some measures of error MAE or MAD: Mean absolute error (deviation) Gives an idea of the magnitude of errors Average error Gives an idea of systematic over- or under-prediction MAPE: Mean absolute percentage error RMSE (root-mean-squared-error): Square the errors, find their average, take the square root 58 Total SSE: Total sum of squared errors

Lift Chart for Predictive Error Similar to lift chart for classification, except Y axis is cumulative value of numeric target variable (e.g., revenue), instead of cumulative count of responses 59

Lift chart example spending 60

Summary Evaluation metrics are important for comparing across DM models, for choosing the right configuration of a specific DM model, and for comparing to the baseline Major metrics: confusion matrix, error rate, predictive error Other metrics when one class is more important asymmetric costs When important class is rare, use oversampling In all cases, metrics computed from validation data 61