Performance Evaluation

Similar documents
Performance Evaluation

Performance evaluation of binary classifiers

Evaluation & Credibility Issues

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Stephen Scott.

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation requires to define performance measures to be optimized

Lecture 3 Classification, Logistic Regression

Performance Metrics for Machine Learning. Sargur N. Srihari

Performance Evaluation

Machine Learning Concepts in Chemoinformatics

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

Data Mining. CS57300 Purdue University. March 22, 2018

CSC321 Lecture 4 The Perceptron Algorithm

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Applied Machine Learning Annalisa Marsico

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Performance Evaluation and Comparison

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Classification: Analyzing Sentiment

Online Advertising is Big Business

Machine Learning Linear Classification. Prof. Matteo Matteucci

Generalized Linear Models

Classifier performance evaluation

Feature Engineering, Model Evaluations

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Anomaly Detection. Jing Gao. SUNY Buffalo

Stats notes Chapter 5 of Data Mining From Witten and Frank

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Lecture 9: Classification, LDA

Bayesian Decision Theory

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

CSC 411: Lecture 03: Linear Classification

Against the F-score. Adam Yedidia. December 8, This essay explains why the F-score is a poor metric for the success of a statistical prediction.

Lecture 9: Classification, LDA

Least Squares Classification

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

More Smoothing, Tuning, and Evaluation

Linear Classifiers as Pattern Detectors

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Lecture 9: Classification, LDA

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Model Accuracy Measures

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Probability and Statistics. Terms and concepts

CC283 Intelligent Problem Solving 28/10/2013

Machine Learning, Midterm Exam

CS 188: Artificial Intelligence Spring Today

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Introduction to Supervised Learning. Performance Evaluation

Classifica(on and predic(on omics style. Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University

Data Mining and Analysis: Fundamental Concepts and Algorithms

LECTURE 15: SIMPLE LINEAR REGRESSION I

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

Evaluation Metrics for Intrusion Detection Systems - A Study

Warm up: risk prediction with logistic regression

ECE521 Lecture7. Logistic Regression

One-to-one functions and onto functions

Diagnostics. Gad Kimmel

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Hypothesis tests

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

Introduction to AI Learning Bayesian networks. Vibhav Gogate

22A-2 SUMMER 2014 LECTURE 5

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Performance Evaluation and Hypothesis Testing

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Pointwise Exact Bootstrap Distributions of Cost Curves

Data Mining algorithms

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Classification: Analyzing Sentiment

Chapter 5 Simplifying Formulas and Solving Equations

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015

18.9 SUPPORT VECTOR MACHINES

How do we compare the relative performance among competing models?

11 More Regression; Newton s Method; ROC Curves

CSE 546 Final Exam, Autumn 2013

10-701/ Machine Learning - Midterm Exam, Fall 2010

Hypothesis Testing. We normally talk about two types of hypothesis: the null hypothesis and the research or alternative hypothesis.

Business Statistics. Lecture 9: Simple Regression

Transcription:

Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36

Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 2 / 36

When is your prediction function good? Compare to previous models, if they exist. Is it good enough for business purposes? But also helpful to have some simple baseline models for comparison, to make sure you re doing significantly better than trivial models to make sure the problem you re working on even has a useful target function David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 3 / 36

Zero-Information Prediction Function (Classification) For classification, let y mode be the most frequently occurring class in training. Prediction function that always predicts y mode is called zero-information prediction function, or no-information prediction function No-information because we re not using any information in input x. David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 4 / 36

Zero-Information Prediction Function (Regression) What s the right zero-information prediction function for square loss? Mean of training data labels (See earlier homework.) What s the right zero-information prediction function for absolute loss? Median of training data labels (See earlier homework.) David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 5 / 36

Single Feature Prediction Functions Choose a basic ML algorithm (e.g. linear regression or decision stumps) Build a set of prediction functions using ML algorithm, each using only one feature David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 6 / 36

Regularized Linear Model Whatever fancy model you are using (gradient boosting, neural networks, etc.) always spend some time building a linear baseline model Build a regularized linear model lasso / ridge / elastic-net regression l 1 - and/or l 2 - regularized logistic regression or SVM If your fancier model isn t beating linear, perhaps something s wrong with your fancier model (e.g. hyperparameter settings), or you don t have enough data to beat the simpler model Prefer simpler models if performance is the same usually cheaper to train and easier to deploy David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 7 / 36

Oracle Models Often helpful to get an upper bound on achievable performance. What s the best performance function you can get, looking at your validation/test data? Performance will estimate the Bayes risk (i.e. optimal error rate). This won t always be 0 - why? Using same model class as your ML model, fit to the validation/test data without regularization. Performance will tell us the limit of our model class, even with infinite training data. Gives estimate of the approximation error of our hypothesis space. David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 8 / 36

Describing Classifier Performance David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 9 / 36

Confusion Matrix A confusion matrix summarizes results for a binary classification problem: Predicted Actual Class 0 Class 1 Class 0 a b Class 1 c d a is the number of examples of Class 0 that the classifier predicted [correctly] as Class 0. b is the number of examples of Class 1 that the classifier predicted [incorrectly] as Class 0. c is the number of examples of Class 0 that the classifier predicted [incorrectly] as Class 1. d is the number of examples of Class 1 that the classifier predicted [correctly] as Class 1. David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 10 / 36

Performance Statistics Many performance statistics are defined in terms of the confusion matrix. Predicted Actual Class 0 Class 1 Class 0 a b Class 1 c d Accuracy is the fraction of correct predictions: a + d a + b + c + d Error rate is the fraction of incorrect predictions: b + c a + b + c + d David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 11 / 36

Performance Statistics Predicted Actual Class 0 Class 1 Class 0 a b Class 1 c d We can talk about accuracy of different subgroups of examples: Accuracy for examples of class 0: a/(a + c) Accuracy for examples predicted to have class 0: a/(a + b). David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 12 / 36

Issue with Accuracy and Error Rate Consider a no-information classifier that achieves the following: Predicted Actual Class 0 Class 1 Class 0 10000 9 Class 1 0 0 Accuracy is 99.9% and error rate is.09%. Two lessons: 1 Accuracy numbers meaningless without knowing the no-information rate or base rate. 2 Accuracy alone doesn t capture what s going on (0% success on class 1). David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 13 / 36

Positive and Negative Classes So far, no class label has ever had any special status. In many contexts, it s very natural to identify a positive class and a negative class. pregnancy test (positive = you re pregnant) radar system (positive = threat detected) searching for documents about bitcoin (positive = document is about bitcoin) statistical hypothesis testing (positive = reject the null hypothesis) David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 14 / 36

FP, FN, TP, TN Let s denote the positive class by + and negative class by : Predicted Actual Class + Class Class + TP FP Class FN TN TP is the number of true positives: predicted correctly as Class +. FP is the number of false positives: predicted incorrectly as Class + (i.e true class ) TN is the number of true negatives: predicted correctly as Class. FN is the number of false negatives: predicted incorrectly as Class (i.e. true class +) David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 15 / 36

Precision and Recall Let s denote the positive class by + and negative class by : Predicted Actual Class + Class Class + TP FP Class FN TN The precision is the accuracy of the positive predictions: TP / (TP + FP) High precision means low false alarm rate (if you test positive, you re probably positive) The recall is the accuracy of the positive class: TP/(TP+FN) High recall means you re not missing many positives David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 16 / 36

Information Retrieval Consider a database of 100,000 documents. Query for bitcoin returns 200 documents 100 of them are actually about bitcoin. 50 documents about bitcoin were not returned. Predicted Actual Class + Class Class + 100 100 Class 50 99,750 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 17 / 36

Precision and Recall Results from bitcoin query: Predicted Actual Class + Class Class + 100 100 Class 50 99,750 The precision is the accuracy of the + predictions: TP / (TP + FP) = 100/200 = 50%. 50% of the documents offered as relevant are actually relevant. The recall is the accuracy of the positive class: TP/(TP+FN) = 100/(100+50) = 67%. 67% of the relevant documents were found (or recalled ). What s an easy way to get 100% recall? David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 18 / 36

F 1 score We really want high precision and high recall. But to choose a best model, we need a single number performance summary The F-measure or F 1 score is the harmonic mean of precision and recall: Ranges from 0 to 1. F 1 = 2 1 1 recall + 1 precision = 2 precision recall precision + recall. F 1 (precision,recall) Precision Recall F 1 1 0.01 0.99 0.02 2 0.20 0.80 0.32 3 0.40 0.90 0.55 4 0.60 0.62 0.61 5 0.90 0.95 0.92 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 19 / 36

F β score Sometimes you want to weigh precision or recall more highly. F β score for β 0: F β = ( 1 + β 2) precision recall (β 2 precision) + recall. Precision Recall F 1 F 0.5 F 2 1 0.01 0.99 0.02 0.01 0.05 2 0.20 0.80 0.32 0.24 0.50 3 0.40 0.90 0.55 0.45 0.72 4 0.60 0.62 0.61 0.60 0.62 5 0.90 0.95 0.92 0.91 0.94 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 20 / 36

TPR, FNR, FPR, TNR Predicted Actual Class + Class Class + TP FP Class FN TN True positive rate is the accuracy on the positive class: TP / (FN + TP) same as recall, also called sensitivity False negative rate is the error rate on the positive class: FN / (FN + TP) ( miss rate ) False positive rate is error rate on the negative class: FP / (FP + TN) also called fall-out or false alarm rate True negative rate is accuracy on the negative class: TN / (FP + TN) ( specificity ) David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 21 / 36

Medical Diagnostic Test: Sensitivity and Specificity Sensitivity is another name for TPR and recall What fraction of people with disease do we identify as having disease? How sensitive is our test to indicators of disease? Specificity is another name for TNR What fraction of people without disease do we identify as being without disease? High specificity means few false alarms In medical diagnosis, we want both sensitivity and specificity to be high. David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 22 / 36

Statistical Hypothesis Testing In a statistical hypothesis test, there are two possible actions: reject the null hypothesis (Predict +), or don t reject the null hypothesis (Predict ). Two possible error types are called Type 1 and Type 2 error. Predicted Class + Class Actual Class + Class Type 1 Error Type 2 Error David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 23 / 36

Thresholding Classification Score Functions David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 24 / 36

The Classification Problem Action space A = R Output space Y = { 1,1} Real-valued prediction function f : X R, called the score function. Convention was f (x) > 0 = Predict 1 f (x) < 0 = Predict 1 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 25 / 36

Example: Scores, Predictions, and Labels ID Score Predicted Class True Class 1-4.80 - - 2-4.43 - - 3-2.09 - - 4-1.30 - - 5-0.53 - + 6-0.30 - + 7 0.49 + - 8 0.98 + - 9 2.25 + + 10 3.37 + + 11 4.03 + + 12 4.90 + + Performance measures: Error Rate = 4/12.33 Precision = 4/6.67 Recall = 4/6.67 F 1 = 4/6.67 Now predict + iff Score>2? Error Rate = 2/12.17 Precision = 4/4 = 1.0 Recall = 4/6.67 F 1 = 0.8 Now predict + iff Score>-1? Error Rate = 2/12.17 Precision = 6/8 =.75 Recall = 6/6 = 1.0 F 1 = 0.86 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 26 / 36

Thresholding the Score Function Generally, different thresholds on the score function lead to different confusion matrices different performance metrics One should choose the threshold that optimizes the business objective. Examples: Maximize F 1 (or F 0.2 or F 2.0, etc.) Maximize Precision, such that Recall > 0.8. Maximize Sensitivity, such that Specificity > 0.7. David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 27 / 36

The Performance Curves David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 28 / 36

Precision-Recall as Function of Threshold ID Score Predicted Class True Class 1-4.80 - - 2-4.43 - - 3-2.09 - - 4-1.30 - - 5-0.53 - + 6-0.30 - + 7 0.49 + - 8 0.98 + - 9 2.25 + + 10 3.37 + + 11 4.03 + + 12 4.90 + + What happens to recall as we decrease threshold from + to? Recall increases (or at least never decreases) What happens to precision as we decrease threshold from + to? If score capture confidence, we expect higher threshold to have higher precision. But no guarantees in general. David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 29 / 36

Precision-Recall Curve What threshold to choose? Depends on your preference between precision and recall. Example from ROCR Package. David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 30 / 36

Receiver Operating Characteristic (ROC) Curve ID Score Predicted Class True Class 1-4.80 - - 2-4.43 - - 3-2.09 - - 4-1.30 - - 5-0.53 - + 6-0.30 - + 7 0.49 + - 8 0.98 + - 9 2.25 + + 10 3.37 + + 11 4.03 + + 12 4.90 + + Recall FPR and TPR: FPR = FP / (Number of Negatives Examples) TPR = TP / (Number of Positives Examples) As we decrease threshold from + to, Number of positives predicted increases - some correct, some incorrect. So both FP and TP increase. ROC Curve charts TPR vs FPR as we vary the threshold... David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 31 / 36

Receiver Operating Characteristic (ROC) Curve Ideal ROC curve would go straight up on the left side of the chart. Example from ROCR Package. David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 32 / 36

Comparing ROC Curves Here we have ROC curves for 3 score functions. For different FPRs, different score functions give better TPRs. No score function dominates another at every FPR. Can we come up with an overall performance measure for a score function? Figure by bor from Wikimedia Commons. David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 33 / 36

Area Under the ROC Curve AUC ROC = area under the ROC curve Often just referred to as AUC A single number commonly used to summarize classifier performance. Much more can be said about AUC and ROC curves... People also consider AUC PR = area under the PR curve David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 34 / 36

Recall: The Cell Phone Churn Problem Cell phone customers often switch carriers. Called churn. Often cheaper to retain a customer than to acquire a new one. You can try to retain a customer by giving a promotion, such as a discount. If you give a discount to somebody who was going to churn, you probably saved money. If you give a discount to somebody who was NOT going to churn, you wasted money. Now we ve trained a classifier to predict churn. We need to choose a threshold on our score function We will give a promotion to everybody with score above threshold. David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 35 / 36

Lift Curves for Predicting Churners x value: number of users targeted y value is number churners in target group. Baseline is for a random score function Each curve is a lift curve shows increase in successes from model over baseline David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 36 / 36