Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Size: px

Start display at page:

Download "Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation"

Phyllis Sutton
5 years ago
Views:

1 Data Privacy in Biomedicine Lecture 11b: Performance Measures for System Evaluation Bradley Malin, PhD Professor of Biomedical Informatics, Biostatistics, & Computer Science Vanderbilt University February 14, 2018

2 Classification Imagine you have a dataset of tokens D You believe they can be classified into two different classes ID: Tokens that correspond to PHI Non-ID: Tokens that correspond to non-phi 2018 Bradley Malin 2

3 So Try to Classify You have written several competing methods for classification Now you want to know which one is better How can you do this in a quantitative manner? 2018 Bradley Malin 3

4 Notation and Models Dataset of Tokens D d 1 d 2 d 3 d Bradley Malin 4

5 Notation and Models Known Underlying Truth T = {t 1,, t n } d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI 2018 Bradley Malin 5

6 Algorithm A Predictions Predictions A = {a 1,, a n } d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI a 1 = PHI a 2 = not PHI a 3 = PHI a 4 = PHI 2018 Bradley Malin 6

7 Algorithm A Predictions Correct Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI a 1 = PHI a 2 = not PHI a 3 = PHI a 4 = PHI 2018 Bradley Malin 7

8 Algorithm A Predictions Incorrect Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI a 1 = PHI a 2 = not PHI a 3 = PHI a 4 = PHI 2018 Bradley Malin 8

9 Algorithm B Predictions Predictions B = {b 1,, b n } d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI b 1 = not PHI b 2 = not PHI b 3 = PHI b 4 = PHI 2018 Bradley Malin 9

10 Algorithm B Predictions Correct Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI b 1 = not PHI b 2 = not PHI b 3 = PHI b 4 = PHI 2018 Bradley Malin 10

11 Algorithm B Predictions Incorrect Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI b 1 = not PHI b 2 = not PHI b 3 = PHI b 4 = PHI 2018 Bradley Malin 11

12 Enter the Contingency Table MODEL PREDICTED It s NOT PHI It s PHI GOLD STANDARD TRUTH Was NOT PHI A B Was PHI C D 2018 Bradley Malin 12

13 Contingency Terms MODEL PREDICTED NO EVENT EVENT GOLD STANDARD TRUTH NO EVENT EVENT TRUE NEGATIVE C B TRUE POSITIVE 2018 Bradley Malin 13

14 Some More Terms MODEL PREDICTED NO EVENT EVENT GOLD STANDARD TRUTH NO EVENT EVENT A FALSE NEGATIVE (Type 2 Error) FALSE POSITIVE (Type 1 Error) D 2018 Bradley Malin 14

15 Accuracy What does this mean? What is the difference between accuracy and an accurate prediction? Contingency Table Interpretation (True Positives) + (True Negatives) (True Positives) + (True Negatives) + (False Positives) + (False Negatives) Is this a good measure? (Why or Why Not?) 2018 Bradley Malin 15

16 Algorithm Comparison Algorithm A Algorithm B NOT PERSON PERSON NOT PERSON PERSON TRUTH NOT PERSON 1 1 PERSON 0 2 TRUTH NOT PERSON 1 1 PERSON 1 1 Accuracy Algorithm A 3 / 4 = 0.75 Accuracy Algorithm B 2 / 4 = Bradley Malin 16

17 Note on Discrete Classes TRADITIONALLY Show contingency table when reporting predictions of model. BUT probabilistic models do not provide discrete calculations of the matrix cells!!! IN OTHER WORDS An algorithm does not necessarily report the number of documents that were correctly predicted INSTEAD report probability the output will be certain variable (e.g. PHI or not PHI ) 2018 Bradley Malin 17

18 What to Do if Classification is Probabilistic? Imagine you have 2 different probabilistic classification models e.g. Algorithm A vs. Algorithm B How do you know which one is better? How do you communicate your belief? Can you provide quantitative evidence beyond a gut feeling and subjective interpretation? 2018 Bradley Malin 18

19 Frequency Which Score Should Be The Threshold? ?????? NOT Person NOT PHI PHI Person Score 2018 Bradley Malin 19

20 Consider Precision-Recall First, order your documents by score for positive class E.g. PHI scores from Algorithm A (higher the score, higher the confidence) d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI Bradley Malin 20

21 Recall Now, choose a threshold score and make classification Ex: Threshold = 0.4 Classify all as NOT PHI Classify all as PHI d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI Bradley Malin 21

22 Recall Recall is the number of documents you wanted that you classified as PHI With Threshold at 0.4, Recall = 1.0 Classify all as NOT PHI Classify all as PHI d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI Bradley Malin 22

23 Precision Precision is the number of documents you recalled that were labeled correctly as PHI With Threshold at 0.4, Precision = 0.67 Classify all as NOT PHI Classify all as PHI d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI Bradley Malin 23

24 Precision Recall ala Venn Set of Documents D The Set of Relevent Documents in the D (i.e. PHI class) The Set of Documents classified as Relevent (i.e. PHI ) 2018 Bradley Malin 24

25 Precision Recall ala Venn RECALL Z / (X + Z) X Z Y PRECISION Z / (Z + Y) The Set of Relevent Documents in the D (i.e. PHI class) The Set of Documents classified as Relevent (i.e. PHI ) 2018 Bradley Malin 25

26 Precision Recall Curve Previous example showed Recall and Precision for single threshold Now calculate scores at thresholds across the range of scores Plot the resulting scores as <recall, precision> coordinate points Usually in range [0,1] Standard 11 point curve, i.e. 11 points plotted 2018 Bradley Malin 26

27 P-R Curve Example 4 documents not enough for P-R curve Imagine you had 200 documents (100 PHI and 100 NOT PHI ) This graph is P-R curve for Algorithm A 1 Precision Recall 2018 Bradley Malin 27

28 P-R Curve Example To compare algorithms, consider plotting both P-R curves in the same graph Use critical points or thresholds to determine which algorithm is better in particular scenarios From a general perspective, the area under the curve, or AUC, provides a measure of how good a classification method is Bradley Malin 28

29 Comparative Performance 1 Precision Algorithm A Algorithm B Recall 2018 Bradley Malin 29

30 ROC Curves Receiver operator characteristic Summarize & present performance of any binary classification model Models ability to distinguish between false & true positives 2018 Bradley Malin 30

31 Beyond Precision Recall: ROC Originated from signal detection theory Binary signal corrupted by Guassian noise What is the optimal threshold (i.e., operating point)? Dependence on 3 factors Signal Strength Noise Variance Personal tolerance in Hit / False Alarm Rate 2018 Bradley Malin 31

32 Also Uses Multiple Contingency Tables Sample contingency tables from range of threshold/probability. TRUE POSITIVE RATE (also called SENSITIVITY) True Positives (True Positives) + (False Negatives) FALSE POSITIVE RATE (also called 1 - SPECIFICITY) False Positives (False Positives) + (True Negatives) Plot Sensitivity vs. (1 Specificity) for sampling and you are done 2018 Bradley Malin 32

33 Data-Centric Example TRUTH LOGISTIC NEURAL Bradley Malin 33

34 ROC Rates LOGISTIC REGRESSION NEURAL NETWORK THRESHOLD TP-Rate FP-Rate TP-Rate FP-Rate Bradley Malin 34

35 ROC Point Plot model1 model2 1 sensitivity model1 model specificity 1 model1 model2 LOGISTIC ivity NEURAL 2018 Bradley Malin 35

36 Sidebar: Use More Samples Sensitivity Specificitiy sensitivity specificity RM RM_AGE41 Series1 Linear (RM) Deviance Model RM+Age41 (These are plots from a much larger dataset) 2018 Bradley Malin 36

37 ROC Quantification Area Under ROC Curve Use quadrature to calculate the area e.g. trapz (trapezoidal rule) function in Matlab will work most programs have a function you can call (python: roc_curve, R: roc.area) AREA UNDER ROC CURVE LOGISTIC NEURAL Example Appears Neural Network model is better 2018 Bradley Malin 37

38 Theory: Model Optimality Classifiers on convex hull are always optimal e.g., Net & Tree Neural Net Decision Tree Naïve Bayes Classifiers below convex hull are always suboptimal e.g., Naïve Bayes 2018 Bradley Malin 38

39 Building Better Classifiers Classifiers on convex hull can be combined to form a strictly dominant hybrid classifier Neural Net Decision Tree ordered sequence of classifiers can be converted into ranker 2018 Bradley Malin 39

40 Some Statistical Insight Curve Area: Take random non-phi from records score of X Take random PHI from records score of Y Area estimate of P [Y > X] Slope of curve is equal to likelihood: P (score Signal) P (score Noise) ROC graph captures all information in conting. table False negative & true negative rates are complements of true positive & false positive rates, resp Bradley Malin 40

41 Can Always Quantify Best Operating Point When misclassification costs are equal, best operating point is 45 tangent to curve closest to (0,1) coord. Verify this mathematically (economic interpretation) Sensitivity Specificitiy RM RM_AGE41 Series1 Linear (RM) Why? 2018 Bradley Malin 41

42 Quick Question Are ROC curves always appropriate? Subjective operating points? Must weight the tradeoffs between false positives and false negatives ROC curve plot is independent of the class distribution or error costs This leads into utility theory (not touching this today) 2018 Bradley Malin 42

43 Much Much More than ROC You should also look up and learn about: Confidence intervals Iso-accuracy lines Skew distributions and why the 45 line isn t always best Convexity vs. non-convexity vs. concavity Mann-Whitney-Wilcoxon sum of ranks Gini coefficient Calibrated thresholds Averaging ROC curves Cost Curves 2018 Bradley Malin 43

44 Some References Drummond C and Holte R. What ROC curves can and can t do (and cost curves can). In Proceedings of the Workshop on ROC Analysis in AI; in conjunction with the European Conference on AI. Valencia, Spain Lasko T, Bhagwat JG, Zou KH, Ohno-Machado L. The use of receiver operator characteristic curves in biomedical informatics. Journal of Biomedical Informatics. 2005; 38(5): McNeil BJ, Hanley JA. Statistical approaches to the analysis of receiver operating characteristic (ROC) curves. Medical Decision Making. 1984; 4: Provost F and Fawcett T. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the 15 th International Conference on Machine Learning. Madison, Wisconsin. 1998: Swets J. Measuring the accuracy of diagnostic systems. Science. 1988; 240(4857): (based on his 1967 book Information Retrieval Systems) 2018 Bradley Malin 44

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology