Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Similar documents
Evaluating Hypotheses

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Stephen Scott.

Hypothesis Evaluation

Performance Evaluation and Hypothesis Testing

Performance Evaluation and Comparison

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Evaluation. Andrea Passerini Machine Learning. Evaluation

How do we compare the relative performance among competing models?

Computational Learning Theory

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Empirical Evaluation (Ch 5)

Evaluation requires to define performance measures to be optimized

6.867 Machine Learning

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

6.867 Machine Learning

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1


CHAPTER EVALUATING HYPOTHESES 5.1 MOTIVATION

CS 543 Page 1 John E. Boon, Jr.

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Empirical Risk Minimization, Model Selection, and Model Assessment

Introduction to Bayesian Learning. Machine Learning Fall 2018

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0.

STAT 285: Fall Semester Final Examination Solutions

Bayesian Learning (II)

Diagnostics. Gad Kimmel

What is a random variable

Decision Trees. Tirgul 5

CSC314 / CSC763 Introduction to Machine Learning

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Model Accuracy Measures

Introduction to Machine Learning. Lecture 2

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Evaluation & Credibility Issues

(It's not always good, but we can always make it.) (4) Convert the normal distribution N to the standard normal distribution Z. Specically.

Hypothesis tests

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Introduction to Supervised Learning. Performance Evaluation

2.1 Lecture 5: Probability spaces, Interpretation of probabilities, Random variables

Example continued. Math 425 Intro to Probability Lecture 37. Example continued. Example

Bayesian Methods: Naïve Bayes

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Least Squares Classification

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Lecture 2 Sep 5, 2017

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Linear Classifiers: Expressiveness

Naïve Bayes classification

Statistical and Computational Learning Theory

Statistical Learning Theory: Generalization Error Bounds

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Introduction to Probability and Statistics (Continued)

Practice Problems Section Problems

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Detection theory. H 0 : x[n] = w[n]

COMP90051 Statistical Machine Learning

Complex Numbers. Essential Question What are the subsets of the set of complex numbers? Integers. Whole Numbers. Natural Numbers

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

COMPUTATIONAL LEARNING THEORY

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

PAC Model and Generalization Bounds

Null Hypothesis Significance Testing p-values, significance level, power, t-tests Spring 2017

Properties of Random Variables

Things to remember when learning probability distributions:

Lecture Topic 4: Chapter 7 Sampling and Sampling Distributions

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Machine Learning Linear Classification. Prof. Matteo Matteucci

Announcements. Proposals graded

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

CMPT 882 Machine Learning

Pointwise Exact Bootstrap Distributions of Cost Curves

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

If we want to analyze experimental or simulated data we might encounter the following tasks:

Naive Bayes classification

Machine Learning. Lecture 9: Learning Theory. Feng Li.

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Probability and Probability Distributions. Dr. Mohammed Alahmed

Poisson approximations

Deep Learning for Computer Vision

Week 1 Quantitative Analysis of Financial Markets Distributions A

6 The normal distribution, the central limit theorem and random samples

Efficient and Principled Online Classification Algorithms for Lifelon

ECE 661: Homework 10 Fall 2014

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

probability of k samples out of J fall in R.

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Mathematical Statistics

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Probability and Statistics. Terms and concepts

Transcription:

Evaluating Classifiers Lecture 2 Instructor: Max Welling

Evaluation of Results How do you report classification error? How certain are you about the error you claim? How do you compare two algorithms? How certain are you if you state one algorithm performs better than another?

Evaluation Given: Hypothesis h(x): X C, in hypothesis space H, mapping attributes x to classes c=[1,2,3,...c] A data-sample S(n) of size n. Questions: What is the error of h on unseen data? If we have two competing hypotheses, which one is better on unseen data? How do we compare two learning algorithms in the face of limited data? How certain are we about our answers?

Sample and True Error We can define two errors: 1) Error(h S) is the error on the sample S: n 1 error( h S ) [ h( xi) yi] n i 1 2) Error(h P) is the true error on the unseen data sampled from the distribution P(x): error ( h P ) dx P ( x ) [ h( x ) f ( x )] where f(x) is the true hypothesis.

Binomial Distributions Assume you toss a coin n times. And it has probability p of coming heads (which we will call success) What is the probability distribution governing the number of heads in n trials? Answer: the Binomial distribution. n! r p(# heads r p, n) p (1 p) r!( n r )! n r

Distribution over Errors Consider some hypothesis h(x) Draw n samples Xk~P(X). Do this k times. Compute e1=n*error(h X1), e2=n*error(h X2),...,ek=n*error(h Xk). {e1,...,ek} are samples from a Binomial distribution! Why? imagine a magic coin, where God secretly determines the probability of heads by the following procedure. First He takes some random hypothesis h. Then, He draws x~p(x) and observes if h(x) correctly predicts the label correctly. If it does, he makes sure the coin lands heads up... You have a single sample S, for which you observe e(s) errors. What would be a reasonable estimate for Error(h P) you think?

Binomial Moments mean ( r ) E [ r n, p] np var r E r E r np p 2 ( ) [( [ ]) ] (1 ) var mean If we match the mean, np, with the observed value n*error(h S) we find: E [ error ( h P )] E [ r / n] p error ( h S ) If we match the variance we can obtain an estimate of the width: var[ error ( h P )] var[ r / n] error( h S ) (1 error( h S )) n

Confidence Intervals We would like to state: With N% confidence we believe that error(h P) is contained in the interval: error ( h P ) error ( h S ) zn error ( h S ) (1 error ( h S )) n 80% z 0.8 1.28 Normal(0,1) 1.28 In principle z N is hard to compute exactly, but for np(1-p)>5 or n>30 it is safe to approximate a Binomial by a Gaussian for which we can easily compute z-values.

Bias-Variance The estimator is unbiased if E [ error ( h X )] p Imagine again you have infinitely many sample sets X1,X2,.. of size n. Use these to compute estimates E1,E2,... of p where Ei=error(h Xi) If the average of E1,E2,.. converges to p, then error(h X) is an unbiased estimator. Two unbiased estimators can still differ in their variance (efficiency). Which one do you prefer? p Eav

Flow of Thought Determine the property you want to know about the future data (e.g. error(h P)) Find an unbiased estimator E for this quantity based on observing data X (e.g. error(h X)) Determine the distribution P(E) of E under the assumption you have infinitely many sample sets X1,X2,...of some size n. (e.g. p(e)=binomial(p,n), p=error(h P)) Estimate the parameters of P(E) from an actual data sample S (e.g. p=error(h S)) Compute mean and variance of P(E) and pray P(E) it is close to a Normal distribution. (sums of random variables converge to normal distributions central limit theorem) State you confidence interval as: with confidence N% error(h P) is contained in the interval Y mean z N var

Assumptions We only consider discrete valued hypotheses (i.e. classes) Training data and test data are drawn IID from the same distribution P(x). (IID: independently & identically distributed) The hypothesis must be chosen independently from the data sample S! When you obtain a hypothesis from a learning algorithm, split the data into a training set and a testing set. Find the hypothesis using the training set and estimate error on the testing set.

Comparing Hypotheses Assume we like to compare 2 hypothesis h1 and h2, which we have tested on two independent samples S1 and S2 of size n1 and n2. I.e. we are interested in the quantity: d error ( h1 P ) error ( h2 P )? Define estimator for d: dˆ error ( h1 X 1) error ( h2 X 2) with X1,X2 sample sets of size n1,n2. Since error(h1 S1) and error(h2 S2) are both approximately Normal their difference is approximately Normal with: mean d error ( h1 S1) error ( h2 S 2) var error ( h1 S1)(1 error ( h1 S1)) error ( h2 S 2)(1 error ( h2 S 2)) n1 n2 Hence, with N% confidence we believe that d is contained in the interval: d mean z N var

Paired Tests Consider the following data: error(h1 s1)=0.1 error(h2 s1)=0.11 error(h1 s2)=0.2 error(h2 s2)=0.21 error(h1 s3)=0.66 error(h2 s3)=0.67 error(h1 s4)=0.45 error(h2 s4)=0.46 and so on. We have var(error(h1)) = large, var(error(h2)) = large. The total variance of error(h1)-error(h2) is their sum. However, h1 is consistently better than h2. We ignored the fact that we compare on the same data. We want a different estimator that compares data one by one. You can use a paired t-test (e.g. in matlab) to see if the two errors are significantly different, or if one error is significantly larger than the other.

Paired t-test Chunk the data up in subsets T1,...,Tk with Ti >30 On each subset compute the error and compute: i error ( h1 Ti ) error ( h2 Ti ) Now compute: 1 k k i i 1 k 1 s ( ) ( i ) kk ( 1) i 1 2 State: With N% confidence the difference in error between h1 and h2 is: tnk, 1 s( ) t is the t-statistic which is related to the student-t distribution (table 5.6).

Comparing Learning Algorithms In general it is a really bad idea to estimate error rates on the same data on which a learning algorithm is trained. WHY? So just as in x-validation, we split the data into k subsets: S {T1,T2,...Tk}. Train both learning algorithm 1 (L1) and learning algorithm 2 (L2) on the complement of each subset: {S-T1,S-T2,...) to produce hypotheses {L1(S-Ti), L2(S-Ti)} for all i. Compute for all i : error ( L ( S T ) T ) error ( L ( S T ) T ) i 1 i i 2 i i Note: we train on S-Ti, but test on Ti. As in the last slide perform a paired t-test on these differences to compute an estimate and a confidence interval for the relative error of the hypothesis produced by L1 and L2.

Evaluation: ROC curves moving threshold class 0 (negatives) class 1 (positives) TP = true positive rate = # positives classified as positive divided by # positives FP = false positive rate = # negatives classified as positives divided by # negatives TN = true negative rate = # negatives classified as negatives divided by # negatives Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that parameter. FN = false negatives = # positives classified as negative divided by # positives

Conclusion Never (ever) draw error-curves without confidence intervals (The second most important sentence of this course)