Performance Evaluation and Comparison

Similar documents
Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

How do we compare the relative performance among competing models?

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Introduction to Supervised Learning. Performance Evaluation

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Diagnostics. Gad Kimmel

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Stephen Scott.

Pointwise Exact Bootstrap Distributions of Cost Curves

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation requires to define performance measures to be optimized

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Hypothesis Evaluation

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Summary: the confidence interval for the mean (σ 2 known) with gaussian assumption

Data Mining and Analysis: Fundamental Concepts and Algorithms

Introduction to Signal Detection and Classification. Phani Chavali

Data Mining. Chapter 5. Credibility: Evaluating What s Been Learned

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Classifier performance evaluation

Stats notes Chapter 5 of Data Mining From Witten and Frank

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Machine Learning Linear Classification. Prof. Matteo Matteucci

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

Empirical Risk Minimization, Model Selection, and Model Assessment

Bayesian Decision Theory

CS 543 Page 1 John E. Boon, Jr.

Chapter ML:II (continued)

ME3620. Theory of Engineering Experimentation. Spring Chapter IV. Decision Making for a Single Sample. Chapter IV

CSC314 / CSC763 Introduction to Machine Learning

Model Accuracy Measures

Performance Evaluation and Hypothesis Testing

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

Introduction to Logistic Regression

Applied Machine Learning Annalisa Marsico

E. Alpaydın AERFAISS

Least Squares Classification

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Probability and Statistics Notes

Summary of Chapters 7-9

EC2001 Econometrics 1 Dr. Jose Olmo Room D309

Statistical Inference

Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS

Empirical Evaluation (Ch 5)

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Sampling distribution of t. 2. Sampling distribution of t. 3. Example: Gas mileage investigation. II. Inferential Statistics (8) t =

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Sociology 6Z03 Review II

Introductory Econometrics

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Performance Evaluation

Bayesian Learning (II)

Evaluating Hypotheses

One-way ANOVA. Experimental Design. One-way ANOVA

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Central Limit Theorem ( 5.3)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Pattern Recognition. Parameter Estimation of Probability Density Functions

Machine Learning in Action

Lecture 9 Two-Sample Test. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

Partitioning the Parameter Space. Topic 18 Composite Hypotheses

Review. December 4 th, Review

Lecture 7: Hypothesis Testing and ANOVA

Hypothesis Testing. Hypothesis: conjecture, proposition or statement based on published literature, data, or a theory that may or may not be true

Evaluation & Credibility Issues

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Anomaly Detection. Jing Gao. SUNY Buffalo

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

CHAPTER 8. Test Procedures is a rule, based on sample data, for deciding whether to reject H 0 and contains:

Hypothesis tests

Performance evaluation of binary classifiers

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Learning with multiple models. Boosting.

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

Chapter 5 Confidence Intervals

Hypothesis testing. Data to decisions

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Mathematical Notation Math Introduction to Applied Statistics

Performance Evaluation

HYPOTHESIS TESTING. Hypothesis Testing

Tables Table A Table B Table C Table D Table E 675

Non-parametric methods

F79SM STATISTICAL METHODS

Visual interpretation with normal approximation

This paper is not to be removed from the Examination Halls

A3. Statistical Inference Hypothesis Testing for General Population Parameters

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CHAPTER 9, 10. Similar to a courtroom trial. In trying a person for a crime, the jury needs to decide between one of two possibilities:

Classification using stochastic ensembles

Transcription:

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012)

Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation 4 Hypothesis Testing 5 Performance Evaluation 6 Performance Comparison

Introduction Two problems: Performance evaluation: assessing the expected error rate of a classification algorithm on a problem. Performance comparison: comparing the expected error rates of two or more classification algorithms to conclude which one is better. Any conclusion drawn is conditioned on the given data set only. No algorithm is the best for all possible data sets. Training data alone is not sufficient for performance evaluation and comparison. A validation set is needed, typically requiring multiple runs to average over different sources of randomness. Performance evaluation and comparison is done based on the distribution of the validation set errors.

Performance Criteria We use error rate as the performance criterion here for performance evaluation and comparison, but there exist some other performance criteria depending on the application. Examples of other criteria: Risks based on more general loss functions than 0-1 loss Training time and space complexity Testing time and space complexity Interpretability Easy programmability Cost-sensitive learning

K-fold Cross Validation The data set X is randomly partitioned into K equal-sized subsets X i, i = 1,..., K, called folds. Stratification: the class distributions in different subsets are kept roughly the same (stratified cross validation). K training/validation set pairs {T i, V i } K i=1 : T 1 = X 2 X 3... X K, V 1 = X 1 T 2 = X 1 X 3... X K, V 2 = X 2. T K = X 1 X 2... X K 1, V K = X K

Leave-One-Out If N is small, K should be large to allow large enough training sets. One extreme case of K -fold cross validation is leave-one-out where only one instance is left out as the validation set and the remaining N 1 for training, giving N separate training/validation set pairs.

Bootstrapping When the sample size N is very small, a better alternative to cross validation is bootstrapping. Bootstrapping generates new samples, each of size N, by drawing instances randomly from the original sample with replacement. The bootstrap samples usually overlap more than the cross validation samples and hence their estimates are more dependent. Probability that an instance is not chosen after N random draws: (1 1 N )N e 1 = 0.368 So each bootstrap sample contains only approximately 63.2% of the instances. Multiple bootstrap samples are used to maximize the chance that the system is trained on all the instances.

Error Measure If 0 1 loss is used, error calculations will be based on the confusion matrix: Predicted Class True Class Yes No Yes TP: true positive FN: false negative No FP: false postivie TN: true negative Error rate: Error rate = FN + FP FN + FP = TP + TP + TN + FN N where N is the total number of instances in the validation set. For more general loss functions, the risks should be measured instead. For K > 2 classes, the class confusion matrix is a K K matrix such that its (i, j)-th entry contains the number of instances that belong to C i but are assigned to C j.

ROC Curve Receiver operating characteristics (ROC) curve: Hit rate TP FP vs. False alarm rate TP + FN FP + TN The curve is obtained by varying a certain parameter (e.g., threshold for making decision) of the classification algorithm. The area under curve (AUC) is often used as a performance measure.

Point Estimation vs. Interval Estimation A point estimator (e.g., MLE) specifies a value for a parameter θ. An interval estimator specifies an interval within which θ lies with a certain degree of confidence. The probability distribution of the point estimator is used to obtain an interval estimator.

Example: Estimation of Mean of Normal Distribution Given an i.i.d. sample X = {x (i) } N i=1, where x (i) N (µ, σ 2 ) Point estimation of the mean µ: m = i x (i) N N (µ, σ 2 /N) since x (i) are i.i.d. Define a statistic Z with a unit normal distribution: N(m µ) Z = N (0, 1) σ

Two-Sided Confidence Interval For a unit normal distribution, about 95% of Z lies in ( 1.96, 1.96): or N(m µ) P( 1.96 < < 1.96) = 0.95 σ P(m 1.96 σ N < µ < m + 1.96 σ N ) = 0.95 meaning that with 95% confidence, µ lies within 1.96σ/ N units of the sample mean m.

Two-Sided Confidence Interval (2) Let us denote z α such that P(Z > z α ) = α, 0 < α < 1 Because Z is symmetric around the mean, we have P(Z < z α/2 ) = P(Z > z α/2 ) = α/2 Hence, for any specified level of confidence 1 α, we have or N(m µ) P( z α/2 < < z α/2 ) = 1 α σ σ σ P(m z α/2 N < µ < m + z α/2 N ) = 1 α

One-Sided Upper Confidence Interval P(Z < 1.64) = 0.95 95% one-sided upper confidence interval for µ: or N(m µ) P( < 1.64) = 0.95 σ P(m 1.64 σ N < µ) = 0.95

One-Sided Upper Confidence Interval (2) The one-sided upper confidence interval defines a lower bound for µ. In general, a 100(1 α) percent one-sided upper confidence interval for µ can be computed from σ P(m z α N < µ) = 1 α The one-sided lower confidence interval that defines an upper bound can be calculated similarly.

t Distribution When the variance σ 2 is not known, it can be replaced by the sample variance i S 2 = (x (i) m) 2 N 1 The statistic N(m µ)/s follows a t distribution with N 1 degrees of freedom: N(m µ)/s tn 1 For any α (0, 1/2), or N(m µ) P( t α/2,n 1 < < t α/2,n 1 ) = 1 α S S S P(m t α/2,n 1 N < µ < m + t α/2,n 1 N ) = 1 α

t Distribution (2) One-sided confidence intervals can also be defined. The t distribution has larger spread (longer tails) than the unit normal distribution, and generally the interval given by the t distribution is larger due to additional uncertainty about the unknown variance. As N increases, the t distribution will get closer to the normal distribution.

Hypothesis Testing Instead of explicitly estimating some parameters, hypothesis testing uses the sample to test some hypothesis concerning the parameters, e.g., whether the mean is < 0.02. Hypothesis testing: We define a statistic that obeys a certain distribution if the hypothesis is correct. If the random sample is consistent with the hypothesis under consideration (i.e., if the statistic calculated from the sample has a high enough probability of being drawn from the probability), the hypothesis is accepted. Otherwise the hypothesis is rejected. There are parametric and nonparametric tests. Only parametric tests are considered here.

Null Hypothesis Given a sample from a normal distribution N (µ, σ 2 ) where σ is known but µ is unknown. We want to test a specific hypothesis about µ, called the null hypothesis, e.g., H 0 : µ = µ 0 against the alternative hypothesis H 1 : µ µ 0 for some specified constant µ 0

Two-Sided Test We accept H 0 if the sample mean m, a point estimate of µ, is not too far from µ 0. We accept H 0 with level of significance α if µ 0 lies in the 100(1 α) percent confidence interval, i.e., N(m µ0 ) ( z α/2, z α/2 ) σ

One-Sided Test Null and alternative hypotheses: H 0 : µ µ 0 H 1 : µ > µ 0 We accept H 0 with level of significance α if N(m µ0 ) (, z α ) σ

t Tests If the variance σ 2 is not known, the sample variance S 2 will be used instead and so N(m µ0 ) S t N 1 Two-sided t test: H 0 : µ = µ 0 H 1 : µ µ 0 We accept at significance level α if N(m µ0 ) ( t α/2,n 1, t α/2,n 1 ) S One-side t test can be defined similarly.

Binomial Test Single training/validation set pair: a classifier is trained on a training set T and tested on a validation set V. Let p be the (unknown) probability that the classifier makes a misclassification error. For the ith instance in V, we define x (i) as a Bernoulli variable to denote the correctness of the classifier s decision: { x (i) 1 with probability p = 0 with probability 1 p Point estimate of p: ˆp = i x (i) V = i x (i) N

Binomial Test (2) Hypothesis test: H 0 : p p 0 vs. H 1 : p > p 0 Given that the classifier makes e errors on V with V = N, can we say that the classifier has error probability p 0 or less? Let X = N i=1 x (i) denote the number of errors on V. Because x (i) are independent Bernoulli variables, X is binomial: ( ) N P(X = j) = p j (1 p) N j j Under the null hypothesis p p 0, so the probability that there are e errors or less is e ( ) N P(X e) = p j j 0 (1 p 0) N j j=1 Binomial test: accept H 0 if P(X e) < 1 α; reject otherwise.

Paired t Test Multiple training/validation set pairs: if we run the algorithm K times on K training/validation set pairs, we get K error probabilities, p k, k = 1,..., K, on the K validation sets. We can use the paired t test to determine whether to accept the null hypothesis H 0 that the classifier has error probability p 0 or less at significance level α. Bernoulli variables: x (i) k = { 1 if classifier trained on Tk makes an error on instance i of V k 0 otherwise Test statistic: K (m p0 ) where t K 1 S N i=1 p i = x (i) K k=1, m = p K k, S 2 k=1 = (p k m) 2 N K K 1

K -Fold Cross-Validated Paired t Test Given two classification algorithms and a data set, we want to compare and test whether the two algorithms construct classifiers that have the same expected error rate on a new instance. K -fold cross validation is used to obtain K training/validation set pairs, {(T k, V k )} K k=1. For each training/validation set pair (T k, V k ), the two classification algorithms are trained on T k and tested on V k, with error probabilities p 1 k and p2 k. We define p k = p 1 k p2 k, k = 1,..., K The distribution of p k is used for hypothesis testing. if p 1 k and p2 k are both normally distributed, then p k should also be normal (with mean µ).

K -Fold Cross-Validated Paired t Test (2) Null and alternative hypotheses: H 0 : µ = 0 H 1 : µ 0 Test statistic: K (m 0) K m = t K 1 S S where K k=1 m = p K k, S 2 k=1 = (p k m) 2 K K 1 K -fold CV paired t test: accept H 0 at significance level α if K m/s ( tα/2,k 1, t α/2,k 1 ); reject otherwise.

Comparing L > 2 Classification Algorithms When L algorithms are compared based on K training/validation set pairs, we get L groups of K error probabilities each. The problem is to compare the L samples for statistically significant difference. One method that is often used is based on analysis of variance (ANOVA).