Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Similar documents
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Stephen Scott.

Diagnostics. Gad Kimmel

Introduction to Supervised Learning. Performance Evaluation

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Evaluation. Andrea Passerini Machine Learning. Evaluation

Least Squares Classification

Evaluation requires to define performance measures to be optimized

Model Accuracy Measures

Linear Classifiers as Pattern Detectors

Anomaly Detection. Jing Gao. SUNY Buffalo

Stats notes Chapter 5 of Data Mining From Witten and Frank

Performance Evaluation

Evaluation & Credibility Issues

Bayesian Decision Theory

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Machine Learning Linear Classification. Prof. Matteo Matteucci

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Performance Evaluation and Comparison

Machine Learning Concepts in Chemoinformatics

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Classifier Complexity and Support Vector Classifiers

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions

How do we compare the relative performance among competing models?

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Learning Methods for Linear Detectors

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Machine Learning for natural language processing

Performance evaluation of binary classifiers

Feature selection and extraction Spectral domain quality estimation Alternatives

MACHINE LEARNING ADVANCED MACHINE LEARNING

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Performance Evaluation

10-701/ Machine Learning, Fall

Performance Evaluation

Classifier performance evaluation

Probability and Statistics. Terms and concepts

Applied Machine Learning Annalisa Marsico

Machine Learning Lecture 2

ECE521 week 3: 23/26 January 2017

Issues and Techniques in Pattern Classification

Linear Discriminant Functions

Hypothesis Evaluation

Multivariate statistical methods and data mining in particle physics

Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers

PATTERN RECOGNITION AND MACHINE LEARNING

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Lecture 18: Noise modeling and introduction to decision theory

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear Classifiers as Pattern Detectors

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Classification: The rest of the story

Machine Learning Lecture 2

L20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs

Classifier Selection. Nicholas Ver Hoeve Craig Martek Ben Gardner

Computational paradigms for the measurement signals processing. Metodologies for the development of classification algorithms.

Lecture 3 Classification, Logistic Regression

Discriminant analysis and supervised classification

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

Bayes Decision Theory - I

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Announcements. Proposals graded

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Naïve Bayes classification

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Data Mining and Analysis: Fundamental Concepts and Algorithms

Machine Learning Linear Regression. Prof. Matteo Matteucci

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

2 Statistical Estimation: Basic Concepts

Cost-based classifier evaluation for imbalanced problems

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Introduction to Data Science

Performance Metrics for Machine Learning. Sargur N. Srihari

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

CS Homework 2: Combinatorics & Discrete Events Due Date: September 25, 2018 at 2:20 PM

LECTURE NOTE #3 PROF. ALAN YUILLE

Benchmarking Non-Parametric Statistical Tests

STA 414/2104, Spring 2014, Practice Problem Set #1

6.867 Machine learning

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Algorithmisches Lernen/Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Gaussian Models

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Linear Models for Classification

Statistical Pattern Recognition

Transcription:

Classifier Learning Curve How to estimate classifier performance. Learning curves Feature curves Rejects and ROC curves True classification error ε Bayes error ε* Sub-optimal classifier Bayes consistent classifier The Apparent Classification Error The apparent (or resubstitution error) of the training set is positively biased (optimistic). Error Estimation by Test Set Training gendat True error ε Testing Classifier Classification Error Estimate An independent test set is needed! bias Apparent error of training set Design Set Other training set other classifier Other test set other error estimate N..3..3.54.95..7.3.3.5.9 3 4 Training Set Size Test Set Size Training set should be large for good classifiers. Test set should be large for a reliable, unbiased error estimate. In practice often just a single design set is given gendat Cross-validation Rotate n times Testing Training Classifieri crossval Classification Error Estimate Same set for training and testing may give a good classifier, but will definitely yield an optimistically biased error estimate. A large, independent test set yields an unbiased and reliable (small variance) error estimate for a badly trained classifier. A small, independent test set yields an unbiased, but unreliable (large variance) error estimate for a well trained classifier. 5-5 is an often used standard solution for the sizes for training and testing. Is it really good? There are better alternatives. Size test set /n of design set. is (n - )/n of design set. Train and test n test times. Average errors. (Good choice: n = ) All objects are tested ones most reliable test result that is possible. Final classifier: trained by all objects best possible classifier. Error estimate is slightly pessimistically biased. 5 6

Leave-one-out Procedure Rotate n times Training Classifieri crossval testk Classification Error Estimate Epected Learning Curves by Estimated Errors Leave-one-out (n-) error estimate Cross-validation error estimate (rotation method) Test single object True error ε Cross-validation in which n is total number of objects. One object tested at a time. n classifiers to be computed. In general unfeasible for large n. Doable for k-nn classifier (needs no training). 7 Apparent error of training set 8 Averaged Learning Curve Repeated Learning Curves For obtaining theoretically epected curves many repetitions are needed. Small sample sizes have a very large variability. a = gendath([ ]); e = (a,ldc,[,3,5,7,,5,],5); (e); a = gendath([ ]); for j=: e = (a,ldc,[,3,5,7,,5,],); hold on; (e); end 9 Learning Curves for Different Classifier Compleity Peaking Phenomenon, Overtraining Curse of Dimensionality, Rao s Parado f compleity training set size Bayes error More comple classifiers are better in case of large training sets and worse in case of small training sets feature set size (dimensionality) classifier compleity

Eample Overtraining, Polynomial Classifier Eample Overtraining () svc([], p ) gendat svc([], p ) 3 4 5 6 Compleity 5/8/8 3 Eample Overtraining (3) 5/8/8 4 Eample Overtraining (4) parzenc([],h) Compleity Run smoothing parameter in Parzen classifier from small to large 5/8/8 5 Overtraining ÅÆ Increasing Bias 5/8/8 Eample Curse of Dimensionality Fisher classifier for various feature rankings True error 6 Bias Apparent error feature set size (dimensionality) classifier compleity 5/8/8 7 5/8/8 8 3

Confusion Matri () confmat Confusion Matri () real labels obtained labels Confusion matri: confmat objects from class A 8 class B 6 3 class C 4 5 -. error in class A -.3 error in class B -.5 error in class C.8 averaged error 9 Confusion Matri (3) Conclusions on Error Estimations objects from class A 8. class B 6 4. class C..33 class A 9.9 class B 3. class C..3 class A 7.3 class B 5 4.3 class C 3 5.3.3 classification details are only observable in the confusion matri!! Larger training sets yield better classifiers. Independent test sets are needed for obtaining unbiased error estimates. Larger test sets yield more accurate error estimates. Leave-one-out cross-validation seems to be an optimal compromise, but might be computationally infeasible. -fold cross-validation is a good practice. More comple classifiers need larger training sets to avoid overtraining. This holds in particular for larger feature sizes, due to the curse of dimensionality. For too small training sets, more simple classifiers or smaller feature sets are needed. Confusion matrices allow a detailed look at the per class classification Reject and ROC Analysis Outlier Reject rejectc Reject Types Reject Curves Performance Measures Varying Costs and Priors ROC Analysis Error ε r ε Reject objects for which probability density is low: r Reject length Note: in these area the posterior probabilities might be high! 3 4 4

Ambiguity Reject rejectc Reject objects for which classification is unsure: about equal posterior probabilities: Reject Curve Error reject d r d r Reject The classification error ε can be reduced to ε r by rejecting a fraction r of the objects. 5 6 Rejecting Classifier Reject Fraction Reject Original classifier error decrease Rejecting classifier Rejected fraction r Rejected fraction r For reasonably well trained classifiers r > (ε(r=) ε(r)). To decrease the error by %, more than % has to be rejected. 7 8 How much to reject? reject Compare the cost of a rejected object, c r, with the cost of a classification error, c ε : For given total cost c Error ε this is a linear function in the (r,ε) space. Shift it until a possible operating point is reached. d* Cost c d Reject r Objects of class Α, with prior p(a), is classified by S() = in a part η A, assigned to A, and a part, assigned to B. p(a) = η A + η A 9 3 5

Objects of class Β, with prior p(b), is classified by S() = in a part η B, assigned to B, and a part, assigned to A. 3 η B = + ε = + η A = η A + η A η B = η B + η B η = η A + η B p(a) + p(b) = p(a) = η A + p(b) = η B + /p(a): error in class A : error contribution class A /p(b): error in class B : error contribution class B η Α /p(a) : sensitivity (recall) class A ε : classification error (what goes correct in class A) η : performance η Α / (η Α + )/ : precision class A p(b) = η B + ε + η = (what belongs to A in what is classified as A) 5/8/8 3 η A ηb =η A = η B Given a trained classifier and a test set: T N Sensitivity: = Specificity: = True labels T N Error: classified as T N TP FN FP TN Error: probability of erroneous classifications Performance: error Sensitivity of a target class (e.g. diseased patients): performance for objects from that target class. Specificity: performance for all objects outside the target class. Precision of a target class : fraction of correct objects among all objects assigned to that class. Recall: fraction of correctly classified objects. This is identical to the performance. It is also identical to the sensitivity when related to a particular class. False Discovery Rate (FDR): 33 34 ROC Curve roc ROC Curve Shift in decision boundary d Priors = [.5.5] Priors = [..8] For every classifier S() d = two errors, and are made. The ROC curve shows them as function of d. ROC is independent of class prior probabilities Change of prior is analogous to the shift of the decision boundary 35 36 6

ROC Analysis roc When are ROC Curves Useful? () ROC: Receiver-Operator Characteristic (from communication theory) Performance of target class Error class B Performance What is the performance for given cost? What are the cost of a preset performance? Error in non-targets Medical diagnostics Database retrieval 37 Error class A -class pattern recognition Cost Trade-off between cost and performance. 38 When are ROC Curves Useful? () When are ROC Curves Useful? (3) S () operating points: value of d in S() d = 39 Study of the effect of changing priors Combine Classifiers Compare behavior of different classifiers S 3 () 4 S () Any 'virtual' classifier between S () and S () in the (, ) space can be realized by using at random α times S () and (-α) times S (). ε = αε + ( α ε ε = αε + ( α ε A A ) A B B ) B Summary Reject and ROC Reject for solving ambiguity: reject objects close to the decision boundary lower costs. Reject option for protection against outliers. ROC analysis for performance cost trade-off. ROC analysis in case of unknown or varying priors. ROC analysis for comparing / combining classifiers. 4 7