Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain
Variables What we can measure (attributes) Hypotheses What we want to predict (Class values/labels) Examples Training set (labeled data) Model Training Model Predict on new cases
Variables What we can measure (attributes) Hypotheses What we want to predict (Class values/labels) Examples Training set (labeled data) Model Training Model Prediction: Does it example belong to this model? Predict on new cases Classification: what is the most probable label?
Testing the accuracy of a model Is my method good enough? (for the specific problem) How does my method compare to other methods?
Testing the accuracy of a model We need a systematic way to evaluate and compare multiple methods Methods are heterogenous in their purposes, e.g.: 1) Ability to classify instances accurately 2) Predicting/scoring the class labels 3) Methods may predict numerical or nominal values (score, class label, yes/no, posterior probability, etc.) Thus we need a methodology that is applicable to all of them
Training and Testing Accuracy expected performance (accuracy) of the model in future (new) data It is wrong to estimate the accuracy on the same dataset used to build (train) the model. This estimation would be overly optimistic: Overfitting è it won t necessarily adapt well to new different instances
Training and Testing Separate known cases into a training set and a test set Labeled cases Cases for training Cases for testing Training step Evaluation step model On the cases for testing we predict and compare the predictions with the known labels. How to do the splitting? A common splitting choice is 2/3 for training and 1/3 for testing This approach is suitable when the entire dataset is large
Training and Testing How to select the data for training and testing: 1) Stratification: The size of each of the prediction classes should be similar in each subset, training and testing (balanced subsets) 2) Homogeneity: Data sets should have similar properties to have a reliable test. E.g. GC-content, peptide lengths, species represented. These conditions ensure representativity of the different properties and prediction classes (e.g. would you test a model of human transmembrane domains with yeast proteins?) (e.g. think of GC content). Provided that sets are balanced and homogeneous, the accuracy on the test set will be a good estimation of future performance.
Training and Testing N-fold cross validation Test set Data 1/N parts of the data set Training set (N-1)/N parts of the data Accuracy1 Build a predictive model where accuracy is used generically: any measure of prediction performance
Training and Testing N-fold cross validation Test set Data set Training set Accuracy1 Accuracy2 Build a predictive model where accuracy is used generically: any measure of prediction performance
N-fold cross validation Training and Testing Test set Data set Training set Accuracy 1 Accuracy 2 Accuracy 3 Accuracy n Average accuracy The average accuracy reflects the performance of the model on the entire dataset. Important: subsets must be representative of the original data (stratification and homogeneity) The standard is to do 10-fold cross validation
Leave-one out Training and Testing It is like n-fold cross validation, but where n is the size of the set (number of instances), that is: train in all but 1, test on this one Advantages: 1) The greatest possible amount of data is used for training (n-1 instances) 2) It is deterministic: no random sampling of subsets is involved. Disadvantages: 1) Computationally more expensive 2) It cannot be stratified E.g. Imagine you have the same number of examples for 2 classes. A random classifier predicting the majority class is expected to have an error rate of 50%, but in the leave-one out method, the majority class is always the opposite class, which will produce 100% error rate.
Accuracy measures
Accuracy measure Example: The model of transmembrane helices We have two models: (1) the loop model M loop given by the observed frequencies of AA in loops p (2) the helix model M helix given by the observed frequencies of AA in helices q Given a peptide s=x 1 x N we can predict whether it is part of a helix or a loop using the log-likelihood test (assuming uniform priors and positional independence) N S = log L(s M helix) L(s M loop ) = i=1 N As a default, we can use as classification the rule: if S>0 then s is part of a helix if S 0 then s is a loop i=1 q xi p xi
Accuracy measure Example: The model of transmembrane helices Training set S = log L(s M helix ) L(s M loop ) = N i=1 N i=1 q xi p xi A test set: a set of labelled (annotated) proteins that we do not use for training Helix Loop
Accuracy measure Real False
Accuracy measure Our model divides the test set according to our predictions of Real and False: Our predictions Real False The red area contains the predictions (helix) made by our model
Accuracy measure TP (True positives): elements predicted as real that are real TP Real False
Accuracy measure TP (True positives): TN (True Negatives): elements predicted as real that are real elements predicted as false that are false TP Real False TN
Accuracy measure TP (True positives): elements predicted as real that are real TN (True Negatives): elements predicted as false that are false FP (False Positives): elements predicted as real that are false TP Real FP False TN
Accuracy measure TP (True positives): elements predicted as real that are real TN (True Negatives): elements predicted as false that are false FP (False Positives): elements predicted as real that are false FN (False Negatives): elements predicted as false that are real TP Real FN FP False TN
Accuracy measure True Positive Rate (Sensitivity): proportion of true elements that is correctly predicted (a.k.a hit rate, recall) Sn = TPR = TP TP + FN TP Real FN FP False TN False Positive Rate (FPR): proportion of negative cases that are mislabelled (a.k.a. fall-out) FPR = FP FP + TN Specificity: proportion of the negatives that are correctly predicted Sp =1 FPR = TN FP + TN Sn and Sp take values between 0 and 1. A perfect classification would have Sn=1 and Sp=1
Accuracy measure Positive Predictive Value (PPV): sometimes called Precision it gives the fraction of our predictions that are correct PPV = TP TP + FP TP Real FN FP False TN False Discovery Rate (FDR): what fraction of our predictions are wrong FDR = FP FP + TP PPV à 1 means most of our predictions are correct FDR à 0 means that very few of our predictions are wrong
The issue of True Negatives Accuracy measure Sometimes we cannot find a True Negative set (e.g. Think of genomic features, like genes, regulatory regions, etc it is very hard to find real negative cases for some biological features) TP Real FP FN We can still use the TPR, PPV and FDR: TPR = TP TP + FN PPV = TP FP +TP FDR = FP FP + TP
Accuracy measure Overall success rate: is the number of correct classifications divided by the total number of classifications (sometimes called accuracy ): Overall Success Rate = TP + TN TP + TN + FN + FP A value of 1 for the Success rate means that the model identifies all the positive and negative cases correctly The error rate: 1 minus the overall success rate: Error Rate =1 TP + TN TP + TN + FN + FP
Accuracy measure Correlation coefficient (a.k.a. Matthews Correlation Coefficient (MCC)) CC = (TP)(TN ) (FP)(FN ) (TP + FN )(TN + FP)(TP + FP)(TN + FN ) This measure scores positively correct predictions and negatively incorrect ones, and takes values between -1 and 1. The more correct the method, the closer to one CC --> 1 A very bad method will have a CC closer to -1
Accuracy measure TP FP Yes No FN TN This can also be represented by a confusion matrix for a 2-class prediction: Predicted class yes no Actual class yes no true positive false positive false negative true negative
For multiclass predictions: Accuracy measure Predicted class Predicted class a b c Total a b c Total Actual a 88 10 2 100 Actual a 60 30 10 100 class b 14 40 6 60 class b 36 18 6 60 c 18 10 12 40 c 24 12 4 40 (a) Total 120 60 20 Total 120 60 20 Good results correspond to large numbers on the diagonal and small numbers off the diagonal In the example we have 200 instances (100+60+40) and 140 of them are predicted correctly, thus the success rate is 70%. Question: is this a good measure? How many agreements do we expect by chance? (b)
For multiclass predictions: Accuracy measure Predicted class Predicted class a b c Total a b c Total Actual a 88 10 2 100 Actual a 60 30 10 100 class b 14 40 6 60 class b 36 18 6 60 c 18 10 12 40 c 24 12 4 40 (a) Total 120 60 20 Total 120 60 20 Observed values We build the matrix of expected values by using the same totals as before and sharing the total of each class: Totals in each actual (Real) class: a = 100, b = 60, c = 40 (b) Expected values
For multiclass predictions: Accuracy measure Predicted class Predicted class a b c Total a b c Total Actual a 88 10 2 100 Actual a 60 30 10 100 class b 14 40 6 60 class b 36 18 6 60 c 18 10 12 40 c 24 12 4 40 (a) Total 120 60 20 Total 120 60 20 Observed values We build the matrix of expected values by using the same totals as before and sharing the total of each class: Totals in each actual (Real) class: a = 100, b = 60, c = 40 We split each of them into the three groups using the proportions of the predicted classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10% (b) Expected values
For multiclass predictions: Accuracy measure Predicted class Predicted class a b c Total a b c Total Actual a 88 10 2 100 Actual a 60 30 10 100 class b 14 40 6 60 class b 36 18 6 60 c 18 10 12 40 c 24 12 4 40 (a) Total 120 60 20 Total 120 60 20 Observed values We build the matrix of expected values by using the same totals as before and sharing the total of each class: Totals in each actual (Real) class: a = 100, b = 60, c = 40 We split each of them into the three groups using the proportions of the predicted classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10% (b) Expected values
For multiclass predictions: Accuracy measure Predicted class Predicted class a b c Total a b c Total Actual a 88 10 2 100 Actual a 60 30 10 100 class b 14 40 6 60 class b 36 18 6 60 c 18 10 12 40 c 24 12 4 40 140 82 Total 120 60 20 Total 120 60 20 (a) Observed values (b) Expected values To estimate the relative agreement between observed and expected values we can use the kappa statistic: κ = P(A) P(E) 1 P(E) = n(a) n(e) N n(e) = 140 82 200 82 = 0.49 Where P(A) is the probability of agreement and P(E) is the probability of agreement by chance. The maximum possible value is κ=1, and for a random predictor κ=0
Accuracy measure What is a good accuracy? Every measure shows a different perspective on the performance of the model. In general we will use two or more complementary measures to evaluate a model. E.g. a method that finds almost all elements will have an Sn close to 1, but this can be achieved with a method with very low Sp E.g. a method that has Sp close to 1, may have very low Sn In general, one would like to have a method that balances Sn and Sp (or equivalent measures)
Accuracy measure What is a good accuracy? Which accuracy measure we want to maximize often depends on the question Do you want to find all the true cases? (You want higher sensitivity) Or want to find only correct cases? (You want higher specificity) Question: predicting novel genes might require high Sp or perhaps high Sn?
Choosing a prediction threshold
Accuracy measure Although we have one single model in fact we have a family of predictions, which are defined by one or more parameters, e.g. the log-likelihood test: S = log L(s M helix) L(s M loop ) > λ λ λ λ λ λ Real False
Accuracy measure Although we have one single model in fact we have a family of predictions, which are defined by one or more parameters, e.g. the log-likelihood test: S = log L(s M helix) L(s M loop ) > λ λ λ λ λ λ λ λ TP FP TN FN Real False λ λ λ
Receiver Operating Characteristic (ROC) curve A ROC curve is a graphical plot of TPR (Sn) vs. FPR built for the same prediction model by varying one or more of the model parameters It is quite common in binary classifiers For instance, it can be plotted for several values of the discrimination threshold, but other parameters of the model can be used. Real λ λ λ λ λ TPR FPR False λ λ λ λ λ
Receiver Operating Characteristic (ROC) curve Distribution of the scores In negative cases Example for threshold B This area are our positive predictions True Negative TPR FPR In positive cases False Negative A B Low TPR Low FPR C High TPR High FPR Threshold criterion TPR = TP TP + FN FPR = FP FP + TN
Receiver Operating Characteristic (ROC) curve Distribution of the scores In negative cases True Negative 1 Model classification In positive cases False Negative Threshold criterion TPR 0 Random classification 0 1 FPR TPR = TP TP + FN FPR = FP FP + TN
Receiver Operating Characteristic (ROC) curve Each dot in the line corresponds to a choice of parameters (usually 1 single parameter) The information that is not visible in this graph is the threshold used at each point of the graph. The x=y line corresponds to the random classification, i.e choosing positive or negative at every threshold with 50% chance. 1 TPR 0 Model classification Random classification 0 1 FPR TPR = TP TP + FN FPR = FP FP + TN
Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: S = log L(s M helix ) L(s M loop ) S 10 7 4 2 1-0.4-2 -5-9
Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: The test set is labeled: S Known label 10 R 7 R 4 R 2 F 1 R -0.4 R -2 F -5 F -9 F S = log L(s M helix ) L(s M loop )
Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: Let s choose a cut-off (a λ): S = log L(s M helix ) L(s M loop ) S Known label 10 R 7 R 4 R 2 F 1 R -0.4 R -2 F -5 F -9 F 3 = Cut-off for prediction, i.e. above this value we predict R
Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: Calculate TP, FP, for this λ S = log L(s M helix ) L(s M loop ) TPR = S Known label 10 R 7 R 4 R 2 F 1 R -0.4 R -2 F -5 F -9 F TP TP + FN FPR = FP FP + TN λ TP FP TN FN TPR FPR 3 3 0 4 2 3/5 0
Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: Repeat for other λ s S = log L(s M helix ) L(s M loop ) S Known label 10 R 7 R 4 R 2 F 1 R -0.4 R -2 F -5 F -9 F λ TP FP TN FN TPR FPR 3 3 0 4 2 3/5 0 0 4 1 3 1 4/5 1/4 Note: I m using arbitrary intermediate values for cut-off
Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: Repeat for other λ s S = log L(s M helix ) L(s M loop ) S Known label 10 R 7 R 4 R 2 F 1 R -0.4 R -2 F -5 F -9 F λ TP FP TN FN TPR FPR 3 3 0 4 2 3/5 0 0 4 1 3 1 4/5 1/4-7 5 3 1 0 1 3/4 Note: I m using arbitrary intermediate values for cut-off
Receiver Operating Characteristic (ROC) curve Example: Consider the ranking of scores: S = log L(s M helix ) L(s M loop ) Exercise: complete the table You should see that for smaller cut-offs the TPR (sensitivity) increases but the FPR increases as well (i.e. the specificity drops) Whereas for high cut-offs TPR decreases but the FPR is low (specificity is high) λ TP FP TN FN TPR FPR 3 3 0 4 2 3/5 0 0 4 1 3 1 4/5 1/4 The variability of the accuracy as a function of the parameters and/or cut-offs is generally described with a ROC curve. -7 5 3 1 0 1 3/4
Receiver Operating Characteristic (ROC) curve Comparing multiple methods Each line corresponds to a different method ROC curves Random classification Better models are further from the x=y line (random classification) Method 1 Method 2 Method 3 (see e.g. Corvelo et al. PLOS Comp. Biology 2010)
Receiver Operating Characteristic (ROC) curve Example: If you wish to discover at least 60% of the true elements (TPR=0.6), the graph says that Model 1 has lower FPR than Model 2 and 3. We may want to choose Model 1. We would then decide to make predictions with Model 1 and choose parameters that produce FPR=0.2 at TPR=0.6 ROC curves Random classification Method 1 Method 2 Method 3 But is this the best choice?
Receiver Operating Characteristic (ROC) curve Optimal configuration Note that the more distant the points from the diagonal (the line of TPR=FPR) the better the classification. ROC curves An optimal choice for a dot in the curve is the one that is at a maximum distance from the TPR=FPR line. There are standard methods to calculate this point. But again: this is optimal for the balance of TPR and FPR, but it might not be the one most appropriate for the model at hand, e.g. predicting novel genes. Method 1 Method 2 Method 3
Receiver Operating Characteristic (ROC) curve ROC curves Method 1 Method 2 Method 3 Models A summary measure for the best model is the Area Under the Curve (AUC). The best model in general will have the highest AUC The maximum value is AUC=1. The closer AUC is to one, the better the model There are also standard methods to estimate the AUC from the sampled
Receiver Operating Characteristic (ROC) curve ROC curves Method 1 Method 2 Method 3 Question: Models Why do you think there are error bars in the AUC barplot and in the ROC curves?
Precision recall curves ROC curves are useful to compare predictive models. However, they still do not provide a complete picture of the accuracy of models. If we predict many TPs at the cost of producing many false predictions (FP is large), the FPR might not look so bad if in our testing set we have many Negatives, such that TN >> FP: FPR = FP FP + TN " "" 0 TN large So we may have a situation where our TPR is high, the FPR is low, but where for the actual counts FP >> TP That is, TPR is not affected by FP and FPR can be low even if FP is high (as long as TN >> FP).
Precision recall curves For instance, consider a method to classify documents. Let s supposed that the first method selects 100 documents, but 40 are correct. Imagine that our test set is composed of 100 True instances and 10000 Negative instances. TPR 1 = TP TP + FN = 40 100 = 0.4 FPR 1 = FP FP +TN = 60 10000 = 0.006
Precision recall curves For instance, consider a method to classify documents. Let s supposed that the first method selects 100 documents, but 40 are correct. Imagine that our test set is composed of 100 True instances and 10000 Negative instances. TPR 1 = TP TP + FN = 40 100 = 0.4 FPR 1 = FP FP +TN = 60 10000 = 0.006 Now consider a second method selects 680 documents with 80 correct, and imagine that our test set is composed now of 100 True instances and 100000 Negative instances. TPR 2 = TP TP + FN = 80 100 = 0.8 FPR 2 = FP FP +TN = 600 100000 = 0.006 Which method is better?
Precision recall curves The second one may seem better, because it retrieves more relevant documents, but the proportion of predictions that are correct (precision or PPV) is smaller: PPV = TP TP + FP Precision 1 = 40 100 = 0.40 Precision 1 = 80 680 = 0.11 (Note: you can also use FDR = 1 PPV) Thus, one must also take into account the relative cost of the predictions, i.e. the FN and FP values that must be assumed to achieve high TPR One can make TN arbitrarily large to make FPR à 0 So other accuracy measures are needed to have a more correct picture.
Precision recall curves Precision = proportion of the predictions that are correct precision = PPV = TP TP + FP Recall = proportion of the true instances that are correctly recovered recall = TPR = TP TP + FN (see e.g. Plass et al. RNA 2012)
Precision recall curves Model 1 Has greater AUC, but low precision (high cost of false positives) Model B We achieve a lower AUC than model A, but still pretty good. Precision is highly improved
Precision recall curves Model 1 Has greater AUC, but low precision (high cost of false positives) Model 2 We achieve a lower AUC than model A, but still quite good. Precision is highly improved
References Data Mining: Prac-cal Machine Learning Tools and Techniques. Ian H. Wi)en, Eibe Frank, Mark A. Hall. Morgan Kaufmann ISBN 978-0-12-374856-0 http://www.cs.waikato.ac.nz/ml/weka/book.html Methods for Computa-onal Gene Predic-on. W.H. Majoros. Cambridge University Press 2007