Lecture 10: Performance Evaluation of ML Methods

Size: px

Start display at page:

Download "Lecture 10: Performance Evaluation of ML Methods"

Kathlyn Heath
5 years ago
Views:

1 CSE57A Machie Learig Sprig 208 Lecture 0: Performace Evaluatio of ML Methods Istructor: Mario Neuma Readig: fcml: 5.4 (Performace); esl: 7.0 (Cross-Validatio); optioal book: Evaluatio Learig Algorithms Scribe: Jigyu Xi Itroductio Comparig differt machie learig methods or model choices such as usig differt hyperparameters, kerels, etc. is a improtat step i ML research ad applicatios. Typically our goal is to fid the best method ad to do so we eed to be able to measure the performace ad compare those measures amog differet ML methods or parameter/model settigs. Questio: What does the performace of a ML method deped o? choice of learig algorithm/model tuig of model (parameters/hyper-parameters) traiig dataset D ad test dataset performace measure statistical test Questio: What exactly is the purpose of evaluatio? () ML research: compare ew algorithm/method to others specific domai set of bechmark domais (geeral effectiveess) (2) ML applicatios: compare multiple algorithm/methods specific domai (fid the best algorithm for a give applicatio) set of bechmark domais Note o scietific experimets: Performace comparisos ad especially tests to assess statistical sigificace oly make sese if we use exactly the same cross-validatio splits for all methods. Comparig aggregated measures ca oly give a vague hit o which method performs better there is little to o scietific value i such comparisos! 2 Assessig Performace 2. Performace Measures for Regressio Error: Mea absolute error (MAE): e = f(x i ) y i () For further readig especially for PhD studets I recommed this book: Evaluatig Learig Algorithms: A Classificatio Perspective by Nathalie Japkowicz ad Mohak Shah (Cambridge Uiversity Press, 20).

2 2 Mea squared error (MSE): Coefficiet of determiatio: e = (f(x i ) y i ) 2 (2) Coefficiet of determiatio is used to measure how well the regressio fuctio approximates the observed data. Defie where R 2 = SS RES SS T OT = SS REG = SS RES = SS REG = SS T OT = SS T OT SS REG SS T OT (3) (f(x i ) y i ) 2 (4) (f(x i ) ȳ) 2 (5) (y i ȳ) 2 (6) For liear regressio, we have SS RES +SS REG = SS T OT. SS RES is the variace of the residuals, SS REG is the variace of the models predictios ad SS T OT is the sample variace. Negative log predictive desity: The smaller egative log predictive desity is, the model is better. For example, for Gaussia Process, NLP D = 2.2 Performace Measures for Classificatio 0/ loss: NLP D = log p(y) (7) = log p(y i = f(x i ) x i ) (8) ( 2 log(2πσ2 i ) + (yi µi)2 ). 2σi 2 Accuracy: δ h(xi) y i 00% (9) δ h(xi)=y i 00% (0) Whe usig accuracy as measuremet, false positives ad false egatives are treated equally.

3 3 Cofusio Matrix: true class predictio TP FP - FN TN Table : Example of cofusio matrix Table is a example of cofusio matrix. T P is the umber of true positive examples, T N is the umber of true egative examples, F P is the umber of false positive examples ad F N is the umber of false egative examples. P = T P + F N (all positive examples i D) ad N = T N + F P (all egative examples i D). The we defie: false positive rate: F P R = F P T N+F P = F P N false egative rate: F NR = T P R = F N P sesitivity (recall, true positive rate): T P R = specificity (true egative rate): T NR = precisio (positive predictive value): P P V = accuracy: T P +T N P +N = T P +T N T P T P +F N = T P P T N T N+F P = T N N T P T P +F P F-score (harmoic mea of precisio ad recall): 2 prec + rec = 2 precisio recall precisio+recall Area uder the ROC (receiver operatig characteristics) curve (AUC) Predict y = + if p(y = + x, X, y) t (usually t is set to 0.5). The ROC curve plots the true positive rate versus the false positive rate for various values of t. Figure : ROC for logistic regressio usig three differet loss fuctios.

4 4 Iterpretatio: p(y = + x, X, y) gives a rakig of our traiig data. AUC = probability that a radomly selected positive example is raked higher tha a radomly selected egative example. AUC = for a perfect model.i the examples i Figure the squared loss has a slightly higher AUC value as logistic regressio usig the other two losses. Some properties of AUC: takes class imbalace ito accout oly for biary classificatio (ot for regressio or multi-class) ROC curves may cross the thresholds mea differet thigs for differet models Other measures for classificatio: RMSE (for probabilistic classifier) iformatio score cost curves Summary Performace Measures: accuracy is ot appropriate if high class imbalace FP ad FN are ot equally importat cofusio matrix provides all iformatio (ca be exteded to multi-class classificatio) use RMSE for regressio 3 Statistical Tests Questio: Ca we attribute the performace to the classifier/model or is it due to chace? Goal: figure out whether the evaluatio results are represetative for the geeral behavior of the classifier. hypothesis testig to compare two classifier A ad B. Null hypothesis H 0 : A ad B perform equivaletly. Aim: reject this ull hypothesis. Approach:. choose appropriate test 2. compute test statistic 3. if i critical regio, the reject H 0 Note: hypothesis testig is ot a proof! It just gives some evidece. Do ot overvalue the result of statistical tests. It is always possible to show that two models are sigificatly differet (eve if the differece is really small). We just have to ru eough experimets or have eough data. Choosig the right test is as importat as chog the right perfromace measure. 3. Parametric Tests makes strog assumptios about the distributio of the uderlyig data usually it s tricky to verify that all assumptios hold

5 Example: t-test for two matched samples Results for both models eed to come from the same data with matchig radomizatio ad partitios. we eed paired test as samples are ot idepedet.

5 5 Example: t-test for two matched samples Results for both models eed to come from the same data with matchig radomizatio ad partitios. we eed paired test as samples are ot idepedet. we test whether the two samples come from the same populatio. classifier A B CV split acc A acc B split 2 acc A 2 acc B 2 split 3 acc A 3 acc B 3. Table 2: Accuracy of model A ad B i rus of cross validatio with meas µ A ad µ B. H 0 : the meas µ A ad µ B are the same (both models have same performace) To test this ull hypothesis we compute the followig test statistic:.. t = d µ 0 σ d /, () where µ 0 = 0 (otherwise you test whether the average of the differeces is sigificatly differet from µ 0 ) ad d = µ A µ B (di d) 2 σ d = with d i = acc A i acc B i ad degrees of freedom df =. The two-tailed versio of the test as illustrated i Fig. 2 ca ow be used to check if we ca reject H 0. Use the table i Figure 3 to get the critical values t α ad t α for a (user-chose) sigificace level α. If the test statistic t as computed via Eq. () falls ito the critical regio, i.e. t t α ad t t α, we ca reject H 0 at a sigificace level α. Typically α = 0.05 or smaller. Figure 2: Critical regio cotaiig α of the probability mass for a two-tailed test. The oe-tailed versio ca be used to test whether the mea of the accuracies of model A is larger tha that of model B (or vice versa) 2. Assumptios: () Normality: the samples come from a ormally distributed populatio alteratively, the sample size should be greater tha 30 2 See this discussio of oe vs. two tailed tests: faq-what-are-the-differeces-betwee-oe-tailed-ad-two-tailed-tests/.

6 6 (2) Radomess of the samples: samples eed to represet the populatio achieve this by radomly selectig test sets or splittig the data (3) Equal variace of the populatios. This is usually ot true. 3.2 No-parametric Tests makes weaker assumptios less powerful 3 tha parametric tests For example, sig-test, McNemar s test ad Wilcoxo siged-rak test. Example: McNemar s test Let c 0 be the umber of istaces missclassified by A ad correctly classified by B ad c 0 be the umber of istaces missclassified by B ad correctly classified by A. McNemar s test tests the followig ull hypothesis: H 0 : p(c 0 ) = p(c 0 ) (both models have same performace) The test statistic is give as: χ 2 MC = ( c 0 c 0 ) 2 c 0 + c 0. (2) Requiremet: c 0 + c 0 20 χ 2 MC has a Chi-squared distributio with degree of freedom. Fid a illustratio of the critical regio ad the table i Figure 4. For example, if we fix α = 0.0, the χ 2,α = If we get χ 2 MC = 7., the we ca reject H 0 ad assume that A(B) performs sigificatly better tha B(A) with 99% cofidece. Example: Sig test Defie the followig: A wi = umber of experiemts A outperforms B B wi = umber of experiemts B outperforms A H 0 : p(a wi > B wi ) = 0.5 (both models have same performace) If H 0 is true, the umber of wis follows a biomial distributio B(, θ = 0.5). Use two-tailed versio of the test ad get the critical values w α from the table i Figure 5. Use oe-tailed if you wat to test whether the umber of wis of A is larger tha that of model B (ore vice versa). A wi eeds to be larger tha w α to be cosidered statistically sigificatly better at a sigificace level of α. If m > 25, we ca use the ormal approximatio of the biomial distributio. I practice, we ofte use 0 rus of 0-fold cross-validatio (0 0-fold CV), where each 0-fold CV ru is cosidered as oe experiemt ad we compare the average results of these rus i a statistical test. This gives more stable estimates of a method s performace tha usig the results of the cross validatio splits directly (those have a higher varaice!). However, re-samplig is time cosumig ad it actually reuses the same data, which violates the idepedece assumptio of the statistical tests. Notes o Presetig Experimetal Results Presetig experimetal results i a meaigful ad covicig way is very challegig. Icorporatig summary statistics ad plots is extremely helpful for the reader (reviewers of a scietific publicatio, your 3 The power of a statistical sigificace test is defied as the probability that it will reject a false ull hypothesis.

7 7 boss, customers, etc.). Always take ito accout to whom you are presetig the result ad always explai all measures ad plots very carefully. Appedix: Tables Figure 3: t-distributio used i t-test. Figure 4: Chi-squared distributio used i McNemar s test.

8 Figure 5: Biomial distributio used i sig-test. 8

Introductory statistics

Introductory statistics CM9S: Machie Learig for Bioiformatics Lecture - 03/3/06 Itroductory statistics Lecturer: Sriram Sakararama Scribe: Sriram Sakararama We will provide a overview of statistical iferece focussig o the key