Model Accuracy Measures

Similar documents
Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation requires to define performance measures to be optimized

Performance Evaluation

Performance evaluation of binary classifiers

Performance Evaluation and Comparison

Introduction to Supervised Learning. Performance Evaluation

Evaluation & Credibility Issues

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Machine Learning in Action

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Bayesian Decision Theory

Applied Machine Learning Annalisa Marsico

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Machine Learning Linear Classification. Prof. Matteo Matteucci

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Machine Learning Concepts in Chemoinformatics

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

CSC314 / CSC763 Introduction to Machine Learning

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Supplementary Information

Anomaly Detection. Jing Gao. SUNY Buffalo

Computational paradigms for the measurement signals processing. Metodologies for the development of classification algorithms.

Linear Classifiers as Pattern Detectors

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Performance Evaluation

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Performance Evaluation

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

Hidden Markov Models for biological sequence analysis

Pointwise Exact Bootstrap Distributions of Cost Curves

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Hypothesis Evaluation

Performance Evaluation and Hypothesis Testing

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Data Mining and Analysis: Fundamental Concepts and Algorithms

Diagnostics. Gad Kimmel

Hidden Markov Models for biological sequence analysis I

Stephen Scott.

CSC 411: Lecture 03: Linear Classification

Classifier performance evaluation

Data Mining and Knowledge Discovery: Practice Notes

Lecture 9: Classification, LDA

Introduction to Signal Detection and Classification. Phani Chavali

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Lecture 9: Classification, LDA

Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

Optimizing Abstaining Classifiers using ROC Analysis. Tadek Pietraszek / 'tʌ dek pɪe 'trʌ ʃek / ICML 2005 August 9, 2005

Classification using stochastic ensembles

CC283 Intelligent Problem Solving 28/10/2013

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Stats notes Chapter 5 of Data Mining From Witten and Frank

Multiple regression: Categorical dependent variables

Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

Data Mining algorithms

Probability and Statistics. Terms and concepts

Linear Classifiers as Pattern Detectors

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Machine Learning - Michaelmas Term 2016 Lectures 9, 10 : Support Vector Machines and Kernel Methods

Filter Methods. Part I : Basic Principles and Methods

Evaluation Metrics for Intrusion Detection Systems - A Study

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

CS395T Computational Statistics with Application to Bioinformatics

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Computational Statistics with Application to Bioinformatics. Unit 18: Support Vector Machines (SVMs)

BLAST: Target frequencies and information content Dannie Durand

Machine Learning and Data Mining. Linear classification. Kalev Kask

A Least Squares Formulation for Canonical Correlation Analysis

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Mining Classification Knowledge

Online Advertising is Big Business

Kernel Methods and Support Vector Machines

Information Retrieval Tutorial 6: Evaluation

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Least Squares Classification

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Solving and Graphing a Linear Inequality of a Single Variable

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Microarray Data Analysis: Discovery

CS 188: Artificial Intelligence. Outline

Statistics for classification

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

CSE 546 Final Exam, Autumn 2013

Lecture 9: Classification, LDA

Learning Methods for Linear Detectors

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

A Posteriori Corrections to Classification Methods.

The Naïve Bayes Classifier. Machine Learning Fall 2017

Transcription:

Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain

Variables What we can measure (attributes) Hypotheses What we want to predict (Class values/labels) Examples Training set (labeled data) Model Training Model Predict on new cases

Variables What we can measure (attributes) Hypotheses What we want to predict (Class values/labels) Examples Training set (labeled data) Model Training Model Prediction: Does it example belong to this model? Predict on new cases Classification: what is the most probable label?

Testing the accuracy of a model Is my method good enough? (for the specific problem) How does my method compare to other methods?

Testing the accuracy of a model We need a systematic way to evaluate and compare multiple methods Methods are heterogenous in their purposes, e.g.: 1) Ability to classify instances accurately 2) Predicting/scoring the class labels 3) Methods may predict numerical or nominal values (score, class label, yes/no, posterior probability, etc.) Thus we need a methodology that is applicable to all of them

Training and Testing Accuracy expected performance (accuracy) of the model in future (new) data It is wrong to estimate the accuracy on the same dataset used to build (train) the model. This estimation would be overly optimistic: Overfitting è it won t necessarily adapt well to new different instances

Training and Testing Separate known cases into a training set and a test set Labeled cases Cases for training Cases for testing Training step Evaluation step model On the cases for testing we predict and compare the predictions with the known labels. How to do the splitting? A common splitting choice is 2/3 for training and 1/3 for testing This approach is suitable when the entire dataset is large

Training and Testing How to select the data for training and testing: 1) Stratification: The size of each of the prediction classes should be similar in each subset, training and testing (balanced subsets) 2) Homogeneity: Data sets should have similar properties to have a reliable test. E.g. GC-content, peptide lengths, species represented. These conditions ensure representativity of the different properties and prediction classes (e.g. would you test a model of human transmembrane domains with yeast proteins?) (e.g. think of GC content). Provided that sets are balanced and homogeneous, the accuracy on the test set will be a good estimation of future performance.

Training and Testing N-fold cross validation Test set Data 1/N parts of the data set Training set (N-1)/N parts of the data Accuracy1 Build a predictive model where accuracy is used generically: any measure of prediction performance

Training and Testing N-fold cross validation Test set Data set Training set Accuracy1 Accuracy2 Build a predictive model where accuracy is used generically: any measure of prediction performance

N-fold cross validation Training and Testing Test set Data set Training set Accuracy 1 Accuracy 2 Accuracy 3 Accuracy n Average accuracy The average accuracy reflects the performance of the model on the entire dataset. Important: subsets must be representative of the original data (stratification and homogeneity) The standard is to do 10-fold cross validation

Leave-one out Training and Testing It is like n-fold cross validation, but where n is the size of the set (number of instances), that is: train in all but 1, test on this one Advantages: 1) The greatest possible amount of data is used for training (n-1 instances) 2) It is deterministic: no random sampling of subsets is involved. Disadvantages: 1) Computationally more expensive 2) It cannot be stratified E.g. Imagine you have the same number of examples for 2 classes. A random classifier predicting the majority class is expected to have an error rate of 50%, but in the leave-one out method, the majority class is always the opposite class, which will produce 100% error rate.

Accuracy measures

Accuracy measure Example: The model of transmembrane helices We have two models: (1) the loop model M loop given by the observed frequencies of AA in loops p (2) the helix model M helix given by the observed frequencies of AA in helices q Given a peptide s=x 1 x N we can predict whether it is part of a helix or a loop using the log-likelihood test (assuming uniform priors and positional independence) N S = log L(s M helix) L(s M loop ) = i=1 N As a default, we can use as classification the rule: if S>0 then s is part of a helix if S 0 then s is a loop i=1 q xi p xi

Accuracy measure Example: The model of transmembrane helices Training set S = log L(s M helix ) L(s M loop ) = N i=1 N i=1 q xi p xi A test set: a set of labelled (annotated) proteins that we do not use for training Helix Loop

Accuracy measure Real False

Accuracy measure Our model divides the test set according to our predictions of Real and False: Our predictions Real False The red area contains the predictions (helix) made by our model

Accuracy measure TP (True positives): elements predicted as real that are real TP Real False

Accuracy measure TP (True positives): TN (True Negatives): elements predicted as real that are real elements predicted as false that are false TP Real False TN

Accuracy measure TP (True positives): elements predicted as real that are real TN (True Negatives): elements predicted as false that are false FP (False Positives): elements predicted as real that are false TP Real FP False TN

Accuracy measure TP (True positives): elements predicted as real that are real TN (True Negatives): elements predicted as false that are false FP (False Positives): elements predicted as real that are false FN (False Negatives): elements predicted as false that are real TP Real FN FP False TN

Accuracy measure True Positive Rate (Sensitivity): proportion of true elements that is correctly predicted (a.k.a hit rate, recall) Sn = TPR = TP TP + FN TP Real FN FP False TN False Positive Rate (FPR): proportion of negative cases that are mislabelled (a.k.a. fall-out) FPR = FP FP + TN Specificity: proportion of the negatives that are correctly predicted Sp =1 FPR = TN FP + TN Sn and Sp take values between 0 and 1. A perfect classification would have Sn=1 and Sp=1

Accuracy measure Positive Predictive Value (PPV): sometimes called Precision it gives the fraction of our predictions that are correct PPV = TP TP + FP TP Real FN FP False TN False Discovery Rate (FDR): what fraction of our predictions are wrong FDR = FP FP + TP PPV à 1 means most of our predictions are correct FDR à 0 means that very few of our predictions are wrong

The issue of True Negatives Accuracy measure Sometimes we cannot find a True Negative set (e.g. Think of genomic features, like genes, regulatory regions, etc it is very hard to find real negative cases for some biological features) TP Real FP FN We can still use the TPR, PPV and FDR: TPR = TP TP + FN PPV = TP FP +TP FDR = FP FP + TP

Accuracy measure Overall success rate: is the number of correct classifications divided by the total number of classifications (sometimes called accuracy ): Overall Success Rate = TP + TN TP + TN + FN + FP A value of 1 for the Success rate means that the model identifies all the positive and negative cases correctly The error rate: 1 minus the overall success rate: Error Rate =1 TP + TN TP + TN + FN + FP

Accuracy measure Correlation coefficient (a.k.a. Matthews Correlation Coefficient (MCC)) CC = (TP)(TN ) (FP)(FN ) (TP + FN )(TN + FP)(TP + FP)(TN + FN ) This measure scores positively correct predictions and negatively incorrect ones, and takes values between -1 and 1. The more correct the method, the closer to one CC --> 1 A very bad method will have a CC closer to -1

Accuracy measure TP FP Yes No FN TN This can also be represented by a confusion matrix for a 2-class prediction: Predicted class yes no Actual class yes no true positive false positive false negative true negative

For multiclass predictions: Accuracy measure Predicted class Predicted class a b c Total a b c Total Actual a 88 10 2 100 Actual a 60 30 10 100 class b 14 40 6 60 class b 36 18 6 60 c 18 10 12 40 c 24 12 4 40 (a) Total 120 60 20 Total 120 60 20 Good results correspond to large numbers on the diagonal and small numbers off the diagonal In the example we have 200 instances (100+60+40) and 140 of them are predicted correctly, thus the success rate is 70%. Question: is this a good measure? How many agreements do we expect by chance? (b)

For multiclass predictions: Accuracy measure Predicted class Predicted class a b c Total a b c Total Actual a 88 10 2 100 Actual a 60 30 10 100 class b 14 40 6 60 class b 36 18 6 60 c 18 10 12 40 c 24 12 4 40 (a) Total 120 60 20 Total 120 60 20 Observed values We build the matrix of expected values by using the same totals as before and sharing the total of each class: Totals in each actual (Real) class: a = 100, b = 60, c = 40 (b) Expected values

For multiclass predictions: Accuracy measure Predicted class Predicted class a b c Total a b c Total Actual a 88 10 2 100 Actual a 60 30 10 100 class b 14 40 6 60 class b 36 18 6 60 c 18 10 12 40 c 24 12 4 40 (a) Total 120 60 20 Total 120 60 20 Observed values We build the matrix of expected values by using the same totals as before and sharing the total of each class: Totals in each actual (Real) class: a = 100, b = 60, c = 40 We split each of them into the three groups using the proportions of the predicted classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10% (b) Expected values

For multiclass predictions: Accuracy measure Predicted class Predicted class a b c Total a b c Total Actual a 88 10 2 100 Actual a 60 30 10 100 class b 14 40 6 60 class b 36 18 6 60 c 18 10 12 40 c 24 12 4 40 (a) Total 120 60 20 Total 120 60 20 Observed values We build the matrix of expected values by using the same totals as before and sharing the total of each class: Totals in each actual (Real) class: a = 100, b = 60, c = 40 We split each of them into the three groups using the proportions of the predicted classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10% (b) Expected values

For multiclass predictions: Accuracy measure Predicted class Predicted class a b c Total a b c Total Actual a 88 10 2 100 Actual a 60 30 10 100 class b 14 40 6 60 class b 36 18 6 60 c 18 10 12 40 c 24 12 4 40 140 82 Total 120 60 20 Total 120 60 20 (a) Observed values (b) Expected values To estimate the relative agreement between observed and expected values we can use the kappa statistic: κ = P(A) P(E) 1 P(E) = n(a) n(e) N n(e) = 140 82 200 82 = 0.49 Where P(A) is the probability of agreement and P(E) is the probability of agreement by chance. The maximum possible value is κ=1, and for a random predictor κ=0

Accuracy measure What is a good accuracy? Every measure shows a different perspective on the performance of the model. In general we will use two or more complementary measures to evaluate a model. E.g. a method that finds almost all elements will have an Sn close to 1, but this can be achieved with a method with very low Sp E.g. a method that has Sp close to 1, may have very low Sn In general, one would like to have a method that balances Sn and Sp (or equivalent measures)

Accuracy measure What is a good accuracy? Which accuracy measure we want to maximize often depends on the question Do you want to find all the true cases? (You want higher sensitivity) Or want to find only correct cases? (You want higher specificity) Question: predicting novel genes might require high Sp or perhaps high Sn?

Choosing a prediction threshold

Accuracy measure Although we have one single model in fact we have a family of predictions, which are defined by one or more parameters, e.g. the log-likelihood test: S = log L(s M helix) L(s M loop ) > λ λ λ λ λ λ Real False

Accuracy measure Although we have one single model in fact we have a family of predictions, which are defined by one or more parameters, e.g. the log-likelihood test: S = log L(s M helix) L(s M loop ) > λ λ λ λ λ λ λ λ TP FP TN FN Real False λ λ λ

Receiver Operating Characteristic (ROC) curve A ROC curve is a graphical plot of TPR (Sn) vs. FPR built for the same prediction model by varying one or more of the model parameters It is quite common in binary classifiers For instance, it can be plotted for several values of the discrimination threshold, but other parameters of the model can be used. Real λ λ λ λ λ TPR FPR False λ λ λ λ λ

Receiver Operating Characteristic (ROC) curve Distribution of the scores In negative cases Example for threshold B This area are our positive predictions True Negative TPR FPR In positive cases False Negative A B Low TPR Low FPR C High TPR High FPR Threshold criterion TPR = TP TP + FN FPR = FP FP + TN

Receiver Operating Characteristic (ROC) curve Distribution of the scores In negative cases True Negative 1 Model classification In positive cases False Negative Threshold criterion TPR 0 Random classification 0 1 FPR TPR = TP TP + FN FPR = FP FP + TN

Receiver Operating Characteristic (ROC) curve Each dot in the line corresponds to a choice of parameters (usually 1 single parameter) The information that is not visible in this graph is the threshold used at each point of the graph. The x=y line corresponds to the random classification, i.e choosing positive or negative at every threshold with 50% chance. 1 TPR 0 Model classification Random classification 0 1 FPR TPR = TP TP + FN FPR = FP FP + TN

Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: S = log L(s M helix ) L(s M loop ) S 10 7 4 2 1-0.4-2 -5-9

Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: The test set is labeled: S Known label 10 R 7 R 4 R 2 F 1 R -0.4 R -2 F -5 F -9 F S = log L(s M helix ) L(s M loop )

Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: Let s choose a cut-off (a λ): S = log L(s M helix ) L(s M loop ) S Known label 10 R 7 R 4 R 2 F 1 R -0.4 R -2 F -5 F -9 F 3 = Cut-off for prediction, i.e. above this value we predict R

Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: Calculate TP, FP, for this λ S = log L(s M helix ) L(s M loop ) TPR = S Known label 10 R 7 R 4 R 2 F 1 R -0.4 R -2 F -5 F -9 F TP TP + FN FPR = FP FP + TN λ TP FP TN FN TPR FPR 3 3 0 4 2 3/5 0

Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: Repeat for other λ s S = log L(s M helix ) L(s M loop ) S Known label 10 R 7 R 4 R 2 F 1 R -0.4 R -2 F -5 F -9 F λ TP FP TN FN TPR FPR 3 3 0 4 2 3/5 0 0 4 1 3 1 4/5 1/4 Note: I m using arbitrary intermediate values for cut-off

Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: Repeat for other λ s S = log L(s M helix ) L(s M loop ) S Known label 10 R 7 R 4 R 2 F 1 R -0.4 R -2 F -5 F -9 F λ TP FP TN FN TPR FPR 3 3 0 4 2 3/5 0 0 4 1 3 1 4/5 1/4-7 5 3 1 0 1 3/4 Note: I m using arbitrary intermediate values for cut-off

Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: S = log L(s M helix ) L(s M loop ) Exercise: complete the table You should see that for smaller cut-offs the TPR (sensitivity) increases but the FPR increases as well (i.e. the specificity drops) Whereas for high cut-offs TPR decreases but the FPR is low (specificity is high) λ TP FP TN FN TPR FPR 3 3 0 4 2 3/5 0 0 4 1 3 1 4/5 1/4 The variability of the accuracy as a function of the parameters and/or cut-offs is generally described with a ROC curve. -7 5 3 1 0 1 3/4

Receiver Operating Characteristic (ROC) curve Comparing multiple methods Each line corresponds to a different method ROC curves Random classification Better models are further from the x=y line (random classification) Method 1 Method 2 Method 3 (see e.g. Corvelo et al. PLOS Comp. Biology 2010)

Receiver Operating Characteristic (ROC) curve Example: If you wish to discover at least 60% of the true elements (TPR=0.6), the graph says that Model 1 has lower FPR than Model 2 and 3. We may want to choose Model 1. We would then decide to make predictions with Model 1 and choose parameters that produce FPR=0.2 at TPR=0.6 ROC curves Random classification Method 1 Method 2 Method 3 But is this the best choice?

Receiver Operating Characteristic (ROC) curve Optimal configuration Note that the more distant the points from the diagonal (the line of TPR=FPR) the better the classification. ROC curves An optimal choice for a dot in the curve is the one that is at a maximum distance from the TPR=FPR line. There are standard methods to calculate this point. But again: this is optimal for the balance of TPR and FPR, but it might not be the one most appropriate for the model at hand, e.g. predicting novel genes. Method 1 Method 2 Method 3

Receiver Operating Characteristic (ROC) curve ROC curves Method 1 Method 2 Method 3 Models A summary measure for the best model is the Area Under the Curve (AUC). The best model in general will have the highest AUC The maximum value is AUC=1. The closer AUC is to one, the better the model There are also standard methods to estimate the AUC from the sampled

Receiver Operating Characteristic (ROC) curve ROC curves Method 1 Method 2 Method 3 Question: Models Why do you think there are error bars in the AUC barplot and in the ROC curves?

Precision recall curves ROC curves are useful to compare predictive models. However, they still do not provide a complete picture of the accuracy of models. If we predict many TPs at the cost of producing many false predictions (FP is large), the FPR might not look so bad if in our testing set we have many Negatives, such that TN >> FP: FPR = FP FP + TN " "" 0 TN large So we may have a situation where our TPR is high, the FPR is low, but where for the actual counts FP >> TP That is, TPR is not affected by FP and FPR can be low even if FP is high (as long as TN >> FP).

Precision recall curves For instance, consider a method to classify documents. Let s supposed that the first method selects 100 documents, but 40 are correct. Imagine that our test set is composed of 100 True instances and 10000 Negative instances. TPR 1 = TP TP + FN = 40 100 = 0.4 FPR 1 = FP FP +TN = 60 10000 = 0.006

Precision recall curves For instance, consider a method to classify documents. Let s supposed that the first method selects 100 documents, but 40 are correct. Imagine that our test set is composed of 100 True instances and 10000 Negative instances. TPR 1 = TP TP + FN = 40 100 = 0.4 FPR 1 = FP FP +TN = 60 10000 = 0.006 Now consider a second method selects 680 documents with 80 correct, and imagine that our test set is composed now of 100 True instances and 100000 Negative instances. TPR 2 = TP TP + FN = 80 100 = 0.8 FPR 2 = FP FP +TN = 600 100000 = 0.006 Which method is better?

Precision recall curves The second one may seem better, because it retrieves more relevant documents, but the proportion of predictions that are correct (precision or PPV) is smaller: PPV = TP TP + FP Precision 1 = 40 100 = 0.40 Precision 1 = 80 680 = 0.11 (Note: you can also use FDR = 1 PPV) Thus, one must also take into account the relative cost of the predictions, i.e. the FN and FP values that must be assumed to achieve high TPR One can make TN arbitrarily large to make FPR à 0 So other accuracy measures are needed to have a more correct picture.

Precision recall curves Precision = proportion of the predictions that are correct precision = PPV = TP TP + FP Recall = proportion of the true instances that are correctly recovered recall = TPR = TP TP + FN (see e.g. Plass et al. RNA 2012)

Precision recall curves Model 1 Has greater AUC, but low precision (high cost of false positives) Model B We achieve a lower AUC than model A, but still pretty good. Precision is highly improved

Precision recall curves Model 1 Has greater AUC, but low precision (high cost of false positives) Model 2 We achieve a lower AUC than model A, but still quite good. Precision is highly improved

References Data Mining: Prac-cal Machine Learning Tools and Techniques. Ian H. Wi)en, Eibe Frank, Mark A. Hall. Morgan Kaufmann ISBN 978-0-12-374856-0 http://www.cs.waikato.ac.nz/ml/weka/book.html Methods for Computa-onal Gene Predic-on. W.H. Majoros. Cambridge University Press 2007