Data Privacy in Biomedicine Lecture 11b: Performance Measures for System Evaluation Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of Biomedical Informatics, Biostatistics, & Computer Science Vanderbilt University February 14, 2018
Classification Imagine you have a dataset of tokens D You believe they can be classified into two different classes ID: Tokens that correspond to PHI Non-ID: Tokens that correspond to non-phi 2018 Bradley Malin 2
So Try to Classify You have written several competing methods for classification Now you want to know which one is better How can you do this in a quantitative manner? 2018 Bradley Malin 3
Notation and Models Dataset of Tokens D d 1 d 2 d 3 d 4 2018 Bradley Malin 4
Notation and Models Known Underlying Truth T = {t 1,, t n } d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI 2018 Bradley Malin 5
Algorithm A Predictions Predictions A = {a 1,, a n } d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI a 1 = PHI a 2 = not PHI a 3 = PHI a 4 = PHI 2018 Bradley Malin 6
Algorithm A Predictions Correct Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI a 1 = PHI a 2 = not PHI a 3 = PHI a 4 = PHI 2018 Bradley Malin 7
Algorithm A Predictions Incorrect Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI a 1 = PHI a 2 = not PHI a 3 = PHI a 4 = PHI 2018 Bradley Malin 8
Algorithm B Predictions Predictions B = {b 1,, b n } d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI b 1 = not PHI b 2 = not PHI b 3 = PHI b 4 = PHI 2018 Bradley Malin 9
Algorithm B Predictions Correct Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI b 1 = not PHI b 2 = not PHI b 3 = PHI b 4 = PHI 2018 Bradley Malin 10
Algorithm B Predictions Incorrect Predictions d 1 d 2 d 3 d 4 t 1 = PHI t 2 = not PHI t 3 = PHI t 4 = not PHI b 1 = not PHI b 2 = not PHI b 3 = PHI b 4 = PHI 2018 Bradley Malin 11
Enter the Contingency Table MODEL PREDICTED It s NOT PHI It s PHI GOLD STANDARD TRUTH Was NOT PHI A B Was PHI C D 2018 Bradley Malin 12
Contingency Terms MODEL PREDICTED NO EVENT EVENT GOLD STANDARD TRUTH NO EVENT EVENT TRUE NEGATIVE C B TRUE POSITIVE 2018 Bradley Malin 13
Some More Terms MODEL PREDICTED NO EVENT EVENT GOLD STANDARD TRUTH NO EVENT EVENT A FALSE NEGATIVE (Type 2 Error) FALSE POSITIVE (Type 1 Error) D 2018 Bradley Malin 14
Accuracy What does this mean? What is the difference between accuracy and an accurate prediction? Contingency Table Interpretation (True Positives) + (True Negatives) (True Positives) + (True Negatives) + (False Positives) + (False Negatives) Is this a good measure? (Why or Why Not?) 2018 Bradley Malin 15
Algorithm Comparison Algorithm A Algorithm B NOT PERSON PERSON NOT PERSON PERSON TRUTH NOT PERSON 1 1 PERSON 0 2 TRUTH NOT PERSON 1 1 PERSON 1 1 Accuracy Algorithm A 3 / 4 = 0.75 Accuracy Algorithm B 2 / 4 = 0.5 2018 Bradley Malin 16
Note on Discrete Classes TRADITIONALLY Show contingency table when reporting predictions of model. BUT probabilistic models do not provide discrete calculations of the matrix cells!!! IN OTHER WORDS An algorithm does not necessarily report the number of documents that were correctly predicted INSTEAD report probability the output will be certain variable (e.g. PHI or not PHI ) 2018 Bradley Malin 17
What to Do if Classification is Probabilistic? Imagine you have 2 different probabilistic classification models e.g. Algorithm A vs. Algorithm B How do you know which one is better? How do you communicate your belief? Can you provide quantitative evidence beyond a gut feeling and subjective interpretation? 2018 Bradley Malin 18
Frequency Which Score Should Be The Threshold? 20 15 10 5?????? NOT Person NOT PHI PHI Person 0-4 -2 0 2 4 6 Score 2018 Bradley Malin 19
Consider Precision-Recall First, order your documents by score for positive class E.g. PHI scores from Algorithm A (higher the score, higher the confidence) d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI 0.5 0.7 0.9 2018 Bradley Malin 20
Recall Now, choose a threshold score and make classification Ex: Threshold = 0.4 Classify all as NOT PHI Classify all as PHI d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI 0.5 0.7 0.9 2018 Bradley Malin 21
Recall Recall is the number of documents you wanted that you classified as PHI With Threshold at 0.4, Recall = 1.0 Classify all as NOT PHI Classify all as PHI d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI 0.5 0.7 0.9 2018 Bradley Malin 22
Precision Precision is the number of documents you recalled that were labeled correctly as PHI With Threshold at 0.4, Precision = 0.67 Classify all as NOT PHI Classify all as PHI d 2 d 1 d 3 d 4 t 2 = not PHI PHI score 0.2 t 1 = PHI t 3 = PHI t 4 = not PHI 0.5 0.7 0.9 2018 Bradley Malin 23
Precision Recall ala Venn Set of Documents D The Set of Relevent Documents in the D (i.e. PHI class) The Set of Documents classified as Relevent (i.e. PHI ) 2018 Bradley Malin 24
Precision Recall ala Venn RECALL Z / (X + Z) X Z Y PRECISION Z / (Z + Y) The Set of Relevent Documents in the D (i.e. PHI class) The Set of Documents classified as Relevent (i.e. PHI ) 2018 Bradley Malin 25
Precision Recall Curve Previous example showed Recall and Precision for single threshold Now calculate scores at thresholds across the range of scores Plot the resulting scores as <recall, precision> coordinate points Usually in range [0,1] Standard 11 point curve, i.e. 11 points plotted 2018 Bradley Malin 26
P-R Curve Example 4 documents not enough for P-R curve Imagine you had 200 documents (100 PHI and 100 NOT PHI ) This graph is P-R curve for Algorithm A 1 Precision 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 2018 Bradley Malin 27
P-R Curve Example To compare algorithms, consider plotting both P-R curves in the same graph Use critical points or thresholds to determine which algorithm is better in particular scenarios From a general perspective, the area under the curve, or AUC, provides a measure of how good a classification method is. 2018 Bradley Malin 28
Comparative Performance 1 Precision 0.8 0.6 0.4 0.2 0 Algorithm A Algorithm B 0 0.2 0.4 0.6 0.8 1 Recall 2018 Bradley Malin 29
ROC Curves Receiver operator characteristic Summarize & present performance of any binary classification model Models ability to distinguish between false & true positives 2018 Bradley Malin 30
Beyond Precision Recall: ROC Originated from signal detection theory Binary signal corrupted by Guassian noise What is the optimal threshold (i.e., operating point)? Dependence on 3 factors Signal Strength Noise Variance Personal tolerance in Hit / False Alarm Rate 2018 Bradley Malin 31
Also Uses Multiple Contingency Tables Sample contingency tables from range of threshold/probability. TRUE POSITIVE RATE (also called SENSITIVITY) True Positives (True Positives) + (False Negatives) FALSE POSITIVE RATE (also called 1 - SPECIFICITY) False Positives (False Positives) + (True Negatives) Plot Sensitivity vs. (1 Specificity) for sampling and you are done 2018 Bradley Malin 32
Data-Centric Example TRUTH LOGISTIC NEURAL 1 0.7198 0.9038 0 0.2460 0.8455 0 0.1219 0.4655 0 0.1560 0.3204 0 0.7527 0.2491 1 0.3064 0.7129 0 0.7194 0.4983 0 0.5531 0.6513 1 0.2173 0.3806 0 0.0839 0.1619 1 0.8429 0.7028 2018 Bradley Malin 33
ROC Rates LOGISTIC REGRESSION NEURAL NETWORK THRESHOLD TP-Rate FP-Rate TP-Rate FP-Rate 1 1 1 1 1 0.9 1 0.8571 1 1 0.8 1 0.5714 1 0.8571 0.7 0.75 0.4286 1 0.7143 0.6 0.5 0.4286 0.75 0.5714 0.5 0.5 0.4286 0.75 0.2857 0.4 0.5 0.2857 0.75 0.2857 0.3 0.5 0.2857 0.75 0.1429 0.2 0.25 0 0.25 0.1429 0.1 0 0 0.25 0 0 0 0 0 0 2018 Bradley Malin 34
ROC Point Plot model1 model2 1 sensitivity 0.8 0.6 0.4 0.2 0 model1 model2 0 0.2 0.4 0.6 0.8 1 1-specificity 1 model1 model2 LOGISTIC ivity 0.8 0.6 NEURAL 2018 Bradley Malin 35
Sidebar: Use More Samples Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 1-Specificitiy sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 1-specificity RM RM_AGE41 Series1 Linear (RM) Deviance Model RM+Age41 (These are plots from a much larger dataset) 2018 Bradley Malin 36
ROC Quantification Area Under ROC Curve Use quadrature to calculate the area e.g. trapz (trapezoidal rule) function in Matlab will work most programs have a function you can call (python: roc_curve, R: roc.area) AREA UNDER ROC CURVE LOGISTIC 0.7321 NEURAL 0.7679 Example Appears Neural Network model is better 2018 Bradley Malin 37
Theory: Model Optimality Classifiers on convex hull are always optimal e.g., Net & Tree Neural Net Decision Tree Naïve Bayes Classifiers below convex hull are always suboptimal e.g., Naïve Bayes 2018 Bradley Malin 38
Building Better Classifiers Classifiers on convex hull can be combined to form a strictly dominant hybrid classifier Neural Net Decision Tree ordered sequence of classifiers can be converted into ranker 2018 Bradley Malin 39
Some Statistical Insight Curve Area: Take random non-phi from records score of X Take random PHI from records score of Y Area estimate of P [Y > X] Slope of curve is equal to likelihood: P (score Signal) P (score Noise) ROC graph captures all information in conting. table False negative & true negative rates are complements of true positive & false positive rates, resp. 2018 Bradley Malin 40
Can Always Quantify Best Operating Point When misclassification costs are equal, best operating point is 45 tangent to curve closest to (0,1) coord. Verify this mathematically (economic interpretation) Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 1-Specificitiy RM RM_AGE41 Series1 Linear (RM) Why? 2018 Bradley Malin 41
Quick Question Are ROC curves always appropriate? Subjective operating points? Must weight the tradeoffs between false positives and false negatives ROC curve plot is independent of the class distribution or error costs This leads into utility theory (not touching this today) 2018 Bradley Malin 42
Much Much More than ROC You should also look up and learn about: Confidence intervals Iso-accuracy lines Skew distributions and why the 45 line isn t always best Convexity vs. non-convexity vs. concavity Mann-Whitney-Wilcoxon sum of ranks Gini coefficient Calibrated thresholds Averaging ROC curves Cost Curves 2018 Bradley Malin 43
Some References Drummond C and Holte R. What ROC curves can and can t do (and cost curves can). In Proceedings of the Workshop on ROC Analysis in AI; in conjunction with the European Conference on AI. Valencia, Spain. 2004. Lasko T, Bhagwat JG, Zou KH, Ohno-Machado L. The use of receiver operator characteristic curves in biomedical informatics. Journal of Biomedical Informatics. 2005; 38(5): 404-415. McNeil BJ, Hanley JA. Statistical approaches to the analysis of receiver operating characteristic (ROC) curves. Medical Decision Making. 1984; 4: 137-50. Provost F and Fawcett T. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the 15 th International Conference on Machine Learning. Madison, Wisconsin. 1998: 445-453. Swets J. Measuring the accuracy of diagnostic systems. Science. 1988; 240(4857): 1285-1293. (based on his 1967 book Information Retrieval Systems) 2018 Bradley Malin 44