Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Similar documents
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Stephen Scott.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Classifier performance evaluation

VUS and HUM Represented with Mann-Whitney Statistic

Performance Evaluation

Performance evaluation of binary classifiers

Lecture 3 Classification, Logistic Regression

Bayesian Decision Theory

Introduction to Signal Detection and Classification. Phani Chavali

Pointwise Exact Bootstrap Distributions of Cost Curves

Introduction to Statistical Inference

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Optimizing Abstaining Classifiers using ROC Analysis. Tadek Pietraszek / 'tʌ dek pɪe 'trʌ ʃek / ICML 2005 August 9, 2005

Diagnostics. Gad Kimmel

Qualifying Exam in Machine Learning

CSC 411: Lecture 03: Linear Classification

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Multivariate statistical methods and data mining in particle physics

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

FINAL: CS 6375 (Machine Learning) Fall 2014

MODULE -4 BAYEIAN LEARNING

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Model Accuracy Measures

LDA, QDA, Naive Bayes

PATTERN RECOGNITION AND MACHINE LEARNING

Threshold Choice Methods: the Missing Link

A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems

ECE521 Lecture7. Logistic Regression

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Midterm: CS 6375 Spring 2015 Solutions

Evaluation. Andrea Passerini Machine Learning. Evaluation

Performance Evaluation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Generalized Linear Models

CS145: INTRODUCTION TO DATA MINING

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

CSE 546 Final Exam, Autumn 2013

Confidence Intervals for the Area under the ROC Curve

Present Practice, issues and headaches.

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Evaluation requires to define performance measures to be optimized

Maximization of AUC and Buffered AUC in Binary Classification

day month year documentname/initials 1

Lecture #11: Classification & Logistic Regression

AUC Maximizing Support Vector Learning

Support Vector Machines

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Applied Machine Learning Annalisa Marsico

Boosting the Area Under the ROC Curve

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Supervised Learning. Performance Evaluation

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

Introduction to Machine Learning Midterm Exam Solutions

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning

Introduction to Machine Learning Midterm Exam

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Learning with multiple models. Boosting.

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Bayesian Learning (II)

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Midterm, Fall 2003

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

CHAPTER-17. Decision Tree Induction

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Probabilistic Machine Learning. Industrial AI Lab.

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Parameter Estimation. Industrial AI Lab.

Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions

Machine Learning Concepts in Chemoinformatics

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Lecture 9: Bayesian Learning

Classification and Pattern Recognition

Logistic Regression. COMP 527 Danushka Bollegala

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Knowledge Discovery and Data Mining

Machine Learning, Midterm Exam

Day 5: Generative models, structured classification

Performance Evaluation

A Comparison of Different ROC Measures for Ordinal Regression

Performance Evaluation and Comparison

Introduction to Logistic Regression

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Transcription:

Data Privacy in Biomedicine Lecture 11b: Performance Measures for System Evaluation Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of Biomedical Informatics, Biostatistics, & Computer Science Vanderbilt University February 14, 2018

Classification Imagine you have a dataset of tokens D You believe they can be classified into two different classes ID: Tokens that correspond to PHI Non-ID: Tokens that correspond to non-phi 2018 Bradley Malin 2

So Try to Classify You have written several competing methods for classification Now you want to know which one is better How can you do this in a quantitative manner? 2018 Bradley Malin 3

Notation and Models Dataset of Tokens D d 1 d 2 d 3 d 4 2018 Bradley Malin 4

Notation and Models Known Underlying Truth T = {t 1,, t n } d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI 2018 Bradley Malin 5

Algorithm A Predictions Predictions A = {a 1,, a n } d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI a 1 = PHI a 2 = not PHI a 3 = PHI a 4 = PHI 2018 Bradley Malin 6

Algorithm A Predictions Correct Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI a 1 = PHI a 2 = not PHI a 3 = PHI a 4 = PHI 2018 Bradley Malin 7

Algorithm A Predictions Incorrect Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI a 1 = PHI a 2 = not PHI a 3 = PHI a 4 = PHI 2018 Bradley Malin 8

Algorithm B Predictions Predictions B = {b 1,, b n } d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI b 1 = not PHI b 2 = not PHI b 3 = PHI b 4 = PHI 2018 Bradley Malin 9

Algorithm B Predictions Correct Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI b 1 = not PHI b 2 = not PHI b 3 = PHI b 4 = PHI 2018 Bradley Malin 10

Algorithm B Predictions Incorrect Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI b 1 = not PHI b 2 = not PHI b 3 = PHI b 4 = PHI 2018 Bradley Malin 11

Enter the Contingency Table MODEL PREDICTED It s NOT PHI It s PHI GOLD STANDARD TRUTH Was NOT PHI A B Was PHI C D 2018 Bradley Malin 12

Contingency Terms MODEL PREDICTED NO EVENT EVENT GOLD STANDARD TRUTH NO EVENT EVENT TRUE NEGATIVE C B TRUE POSITIVE 2018 Bradley Malin 13

Some More Terms MODEL PREDICTED NO EVENT EVENT GOLD STANDARD TRUTH NO EVENT EVENT A FALSE NEGATIVE (Type 2 Error) FALSE POSITIVE (Type 1 Error) D 2018 Bradley Malin 14

Accuracy What does this mean? What is the difference between accuracy and an accurate prediction? Contingency Table Interpretation (True Positives) + (True Negatives) (True Positives) + (True Negatives) + (False Positives) + (False Negatives) Is this a good measure? (Why or Why Not?) 2018 Bradley Malin 15

Algorithm Comparison Algorithm A Algorithm B NOT PERSON PERSON NOT PERSON PERSON TRUTH NOT PERSON 1 1 PERSON 0 2 TRUTH NOT PERSON 1 1 PERSON 1 1 Accuracy Algorithm A 3 / 4 = 0.75 Accuracy Algorithm B 2 / 4 = 0.5 2018 Bradley Malin 16

Note on Discrete Classes TRADITIONALLY Show contingency table when reporting predictions of model. BUT probabilistic models do not provide discrete calculations of the matrix cells!!! IN OTHER WORDS An algorithm does not necessarily report the number of documents that were correctly predicted INSTEAD report probability the output will be certain variable (e.g. PHI or not PHI ) 2018 Bradley Malin 17

What to Do if Classification is Probabilistic? Imagine you have 2 different probabilistic classification models e.g. Algorithm A vs. Algorithm B How do you know which one is better? How do you communicate your belief? Can you provide quantitative evidence beyond a gut feeling and subjective interpretation? 2018 Bradley Malin 18

Frequency Which Score Should Be The Threshold? 20 15 10 5?????? NOT Person NOT PHI PHI Person 0-4 -2 0 2 4 6 Score 2018 Bradley Malin 19

Consider Precision-Recall First, order your documents by score for positive class E.g. PHI scores from Algorithm A (higher the score, higher the confidence) d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI 0.5 0.7 0.9 2018 Bradley Malin 20

Recall Now, choose a threshold score and make classification Ex: Threshold = 0.4 Classify all as NOT PHI Classify all as PHI d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI 0.5 0.7 0.9 2018 Bradley Malin 21

Recall Recall is the number of documents you wanted that you classified as PHI With Threshold at 0.4, Recall = 1.0 Classify all as NOT PHI Classify all as PHI d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI 0.5 0.7 0.9 2018 Bradley Malin 22

Precision Precision is the number of documents you recalled that were labeled correctly as PHI With Threshold at 0.4, Precision = 0.67 Classify all as NOT PHI Classify all as PHI d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI 0.5 0.7 0.9 2018 Bradley Malin 23

Precision Recall ala Venn Set of Documents D The Set of Relevent Documents in the D (i.e. PHI class) The Set of Documents classified as Relevent (i.e. PHI ) 2018 Bradley Malin 24

Precision Recall ala Venn RECALL Z / (X + Z) X Z Y PRECISION Z / (Z + Y) The Set of Relevent Documents in the D (i.e. PHI class) The Set of Documents classified as Relevent (i.e. PHI ) 2018 Bradley Malin 25

Precision Recall Curve Previous example showed Recall and Precision for single threshold Now calculate scores at thresholds across the range of scores Plot the resulting scores as <recall, precision> coordinate points Usually in range [0,1] Standard 11 point curve, i.e. 11 points plotted 2018 Bradley Malin 26

P-R Curve Example 4 documents not enough for P-R curve Imagine you had 200 documents (100 PHI and 100 NOT PHI ) This graph is P-R curve for Algorithm A 1 Precision 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 2018 Bradley Malin 27

P-R Curve Example To compare algorithms, consider plotting both P-R curves in the same graph Use critical points or thresholds to determine which algorithm is better in particular scenarios From a general perspective, the area under the curve, or AUC, provides a measure of how good a classification method is. 2018 Bradley Malin 28

Comparative Performance 1 Precision 0.8 0.6 0.4 0.2 0 Algorithm A Algorithm B 0 0.2 0.4 0.6 0.8 1 Recall 2018 Bradley Malin 29

ROC Curves Receiver operator characteristic Summarize & present performance of any binary classification model Models ability to distinguish between false & true positives 2018 Bradley Malin 30

Beyond Precision Recall: ROC Originated from signal detection theory Binary signal corrupted by Guassian noise What is the optimal threshold (i.e., operating point)? Dependence on 3 factors Signal Strength Noise Variance Personal tolerance in Hit / False Alarm Rate 2018 Bradley Malin 31

Also Uses Multiple Contingency Tables Sample contingency tables from range of threshold/probability. TRUE POSITIVE RATE (also called SENSITIVITY) True Positives (True Positives) + (False Negatives) FALSE POSITIVE RATE (also called 1 - SPECIFICITY) False Positives (False Positives) + (True Negatives) Plot Sensitivity vs. (1 Specificity) for sampling and you are done 2018 Bradley Malin 32

Data-Centric Example TRUTH LOGISTIC NEURAL 1 0.7198 0.9038 0 0.2460 0.8455 0 0.1219 0.4655 0 0.1560 0.3204 0 0.7527 0.2491 1 0.3064 0.7129 0 0.7194 0.4983 0 0.5531 0.6513 1 0.2173 0.3806 0 0.0839 0.1619 1 0.8429 0.7028 2018 Bradley Malin 33

ROC Rates LOGISTIC REGRESSION NEURAL NETWORK THRESHOLD TP-Rate FP-Rate TP-Rate FP-Rate 1 1 1 1 1 0.9 1 0.8571 1 1 0.8 1 0.5714 1 0.8571 0.7 0.75 0.4286 1 0.7143 0.6 0.5 0.4286 0.75 0.5714 0.5 0.5 0.4286 0.75 0.2857 0.4 0.5 0.2857 0.75 0.2857 0.3 0.5 0.2857 0.75 0.1429 0.2 0.25 0 0.25 0.1429 0.1 0 0 0.25 0 0 0 0 0 0 2018 Bradley Malin 34

ROC Point Plot model1 model2 1 sensitivity 0.8 0.6 0.4 0.2 0 model1 model2 0 0.2 0.4 0.6 0.8 1 1-specificity 1 model1 model2 LOGISTIC ivity 0.8 0.6 NEURAL 2018 Bradley Malin 35

Sidebar: Use More Samples Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 1-Specificitiy sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 1-specificity RM RM_AGE41 Series1 Linear (RM) Deviance Model RM+Age41 (These are plots from a much larger dataset) 2018 Bradley Malin 36

ROC Quantification Area Under ROC Curve Use quadrature to calculate the area e.g. trapz (trapezoidal rule) function in Matlab will work most programs have a function you can call (python: roc_curve, R: roc.area) AREA UNDER ROC CURVE LOGISTIC 0.7321 NEURAL 0.7679 Example Appears Neural Network model is better 2018 Bradley Malin 37

Theory: Model Optimality Classifiers on convex hull are always optimal e.g., Net & Tree Neural Net Decision Tree Naïve Bayes Classifiers below convex hull are always suboptimal e.g., Naïve Bayes 2018 Bradley Malin 38

Building Better Classifiers Classifiers on convex hull can be combined to form a strictly dominant hybrid classifier Neural Net Decision Tree ordered sequence of classifiers can be converted into ranker 2018 Bradley Malin 39

Some Statistical Insight Curve Area: Take random non-phi from records score of X Take random PHI from records score of Y Area estimate of P [Y > X] Slope of curve is equal to likelihood: P (score Signal) P (score Noise) ROC graph captures all information in conting. table False negative & true negative rates are complements of true positive & false positive rates, resp. 2018 Bradley Malin 40

Can Always Quantify Best Operating Point When misclassification costs are equal, best operating point is 45 tangent to curve closest to (0,1) coord. Verify this mathematically (economic interpretation) Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 1-Specificitiy RM RM_AGE41 Series1 Linear (RM) Why? 2018 Bradley Malin 41

Quick Question Are ROC curves always appropriate? Subjective operating points? Must weight the tradeoffs between false positives and false negatives ROC curve plot is independent of the class distribution or error costs This leads into utility theory (not touching this today) 2018 Bradley Malin 42

Much Much More than ROC You should also look up and learn about: Confidence intervals Iso-accuracy lines Skew distributions and why the 45 line isn t always best Convexity vs. non-convexity vs. concavity Mann-Whitney-Wilcoxon sum of ranks Gini coefficient Calibrated thresholds Averaging ROC curves Cost Curves 2018 Bradley Malin 43

Some References Drummond C and Holte R. What ROC curves can and can t do (and cost curves can). In Proceedings of the Workshop on ROC Analysis in AI; in conjunction with the European Conference on AI. Valencia, Spain. 2004. Lasko T, Bhagwat JG, Zou KH, Ohno-Machado L. The use of receiver operator characteristic curves in biomedical informatics. Journal of Biomedical Informatics. 2005; 38(5): 404-415. McNeil BJ, Hanley JA. Statistical approaches to the analysis of receiver operating characteristic (ROC) curves. Medical Decision Making. 1984; 4: 137-50. Provost F and Fawcett T. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the 15 th International Conference on Machine Learning. Madison, Wisconsin. 1998: 445-453. Swets J. Measuring the accuracy of diagnostic systems. Science. 1988; 240(4857): 1285-1293. (based on his 1967 book Information Retrieval Systems) 2018 Bradley Malin 44