How do we compare the relative performance among competing models?

Similar documents
Slides for Data Mining by I. H. Witten and E. Frank

Performance Evaluation and Comparison

Data Mining. Chapter 5. Credibility: Evaluating What s Been Learned

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Stats notes Chapter 5 of Data Mining From Witten and Frank

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Evaluating Hypotheses

Introduction to Supervised Learning. Performance Evaluation

Empirical Evaluation (Ch 5)

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Evaluation. Andrea Passerini Machine Learning. Evaluation

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Evaluation requires to define performance measures to be optimized

LAB 2. HYPOTHESIS TESTING IN THE BIOLOGICAL SCIENCES- Part 2

Performance Evaluation and Hypothesis Testing

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Performance Measures Training and Testing Logging

Summary: the confidence interval for the mean (σ 2 known) with gaussian assumption

Political Science 236 Hypothesis Testing: Review and Bootstrapping

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Summary of Chapters 7-9

Machine Learning: Evaluation

Statistical Analysis How do we know if it works? Group workbook: Cartoon from XKCD.com. Subscribe!

Probability and Statistics. Terms and concepts

Hypothesis tests

Lecture 7: Hypothesis Testing and ANOVA

ECO220Y Review and Introduction to Hypothesis Testing Readings: Chapter 12

Diagnostics. Gad Kimmel

CENTRAL LIMIT THEOREM (CLT)

Stephen Scott.

Statistical Inference for Means

CSC314 / CSC763 Introduction to Machine Learning

CS 543 Page 1 John E. Boon, Jr.

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Chapter 9 Inferences from Two Samples

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

Lecture 30. DATA 8 Summer Regression Inference

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Inferential Statistics

HYPOTHESIS TESTING. Hypothesis Testing

PSY 305. Module 3. Page Title. Introduction to Hypothesis Testing Z-tests. Five steps in hypothesis testing

Single Sample Means. SOCY601 Alan Neustadtl

Hypothesis Evaluation

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

Probability and Statistics

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

LECTURE 5. Introduction to Econometrics. Hypothesis testing

11-2 Multinomial Experiment

Pointwise Exact Bootstrap Distributions of Cost Curves

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

CBA4 is live in practice mode this week exam mode from Saturday!

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Hypothesis Testing. ECE 3530 Spring Antonio Paiva

TUTORIAL 8 SOLUTIONS #

Chapter 5 Confidence Intervals

1 Statistical inference for a population mean

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015

LECTURE 5 HYPOTHESIS TESTING

The probability of an event is viewed as a numerical measure of the chance that the event will occur.

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Classroom Activity 7 Math 113 Name : 10 pts Intro to Applied Stats

Data Mining. CS57300 Purdue University. March 22, 2018

Review. One-way ANOVA, I. What s coming up. Multiple comparisons

Data Mining and Analysis: Fundamental Concepts and Algorithms

Parameter Estimation, Sampling Distributions & Hypothesis Testing

Multiple t Tests. Introduction to Analysis of Variance. Experiments with More than 2 Conditions

KDF2C QUANTITATIVE TECHNIQUES FOR BUSINESSDECISION. Unit : I - V

z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests

The Chi-Square Distributions

Review. A Bernoulli Trial is a very simple experiment:

Visual interpretation with normal approximation

Class 24. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Ordinary Least Squares Regression Explained: Vartanian

Random Variable. Discrete Random Variable. Continuous Random Variable. Discrete Random Variable. Discrete Probability Distribution

Evaluation. Albert Bifet. April 2012

The Components of a Statistical Hypothesis Testing Problem

The Chi-Square Distributions

Probability theory and inference statistics! Dr. Paola Grosso! SNE research group!! (preferred!)!!

(1) Introduction to Bayesian statistics

Evaluation & Credibility Issues

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Introduction to Statistical Data Analysis Lecture 4: Sampling

EC2001 Econometrics 1 Dr. Jose Olmo Room D309

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Statistical Inference

Empirical Risk Minimization, Model Selection, and Model Assessment

CHAPTER 9, 10. Similar to a courtroom trial. In trying a person for a crime, the jury needs to decide between one of two possibilities:

Chapter 12: Estimation

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Two-Sample Inferential Statistics

INTERVAL ESTIMATION AND HYPOTHESES TESTING

Machine Learning Linear Classification. Prof. Matteo Matteucci

Sampling Distributions: Central Limit Theorem

i=1 X i/n i=1 (X i X) 2 /(n 1). Find the constant c so that the statistic c(x X n+1 )/S has a t-distribution. If n = 8, determine k such that

Hypothesis Tests Solutions COR1-GB.1305 Statistics and Data Analysis

1 Descriptive statistics. 2 Scores and probability distributions. 3 Hypothesis testing and one-sample t-test. 4 More on t-tests

Ling 289 Contingency Table Statistics

16.400/453J Human Factors Engineering. Design of Experiments II

Transcription:

How do we compare the relative performance among competing models? 1

Comparing Data Mining Methods Frequent problem: we want to know which of the two learning techniques is better How to reliably say Model A is better or worse than Model B? We can: Compare on different test sets Compare 10-fold CV estimates Both require significance testing.

Significance Tests Significance tests tell us how (statistically) confident we can be that there is truly a difference. For example: Null hypothesis: there is no real difference Alternative hypothesis: there is a difference A significance test measures how much evidence there is in favor of rejecting the null hypothesis

Methods for Comparing Classifiers Two models: Model M1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5,000 instances Can we say M1 is better than M2? How much confidence can have in the accuracy of both models? Can the difference in performance measure be explained as a result of random fluctuations in the test 4 set?

Confidence Intervals We can say: error lies within a certain specified interval within a certain specified confidence Example: S = 750 successes in n = 1000 test examples Estimated error rate: 25% How close is this to the true error rate? With 95% confidence 22.32,27.68

6 Confidence Interval for Accuracy Prediction can be regarded as a Bernoulli trial with two possible outcomes, correct or incorrect. A collection of Bernoulli trials has a Binomial distribution. Given the number of correct test predictions x and the number of test instances N, accuracy acc = x/n. Can we predict the true accuracy of the model from acc?

Confidence Interval for Accuracy For large test sets (N > 30), the accuracy acc has a normal distribution with mean p and variance p 1 p /N. Confidence interval for p is: acc p P Z α/2 < p 1 p /N < Z 1 α/2 = 1 α 7

Confidence Interval for Accuracy Consider a testing set containing 1,000 examples (N = 1000). 750 examples have been correctly classified (x = 750, acc = 75%). If we want an 80% confidence level, then the true performance p is between 73.2% and 76.7%. If we only have 100 training examples and 75 correctly classified examples, the true performance p 8 is between 69.1% and 80.1%.

Confidence Interval for Accuracy Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N = 100, acc = 0.8 Let 1 α = 0.95 (95% confidence) From probability table, Z α/2 = 1.96 N 50 100 500 1000 5000 p (lower) 0.670 0.711 0.763 0.774 0.789 p (upper) 0.888 0.866 0.833 0.824 0.811 1 α Z 0.99 2.58 0.98 2.33 0.95 1.96 0.90 1.65 9

Establishing Confidence Intervals If S contains n examples drawn independently and n 30, then With approximately 95% probability (or confidence), error D h lies in the interval error S h ± 1.96 error D h 1 error D h n

Comparing Performance of Two Models Two models, say M1 and M2, which is better? M1 is tested on D1 (size = n 1 ), found error rate = e 1. M2 is tested on D2 (size = n 2 ), found error rate = e 2. Assume D1 and D2 are independent. If n 1 and n 2 are sufficiently large, then: e 1 ~N μ 1, σ 1 e 2 ~N μ 2, σ 2 Approximate: 11 σ i = e i 1 e i n i

Comparing Performance of Two Models To test if the difference between the performance of M1 and M2 is statistically significant, we consider d = e1 e2. d~n d t, σ t, where d t is the true difference. Since D1 and D2 are independent: σ t 2 = σ 1 2 + σ 2 2 σ 1 2 + σ 1 2 = e 1 1 e 1 n 1 + e 2 1 e 2 n 2 At (1 α) confidence level: d t = d ± Z α/2 σ t. 12

Example of Comparing Two Models Given M1 with n 1 = 30 and e 1 = 0.15 and M2 with n 2 = 5000 and e 2 = 0.25, d = 0.1 (2-sided test). Thus, σ d = 0.15 1 0.15 30 + 0.25 1 0.25 5000 At 95% confidence level, Z α/2 = 1.96: = 0.0043 d t = 0.100 ± 1.96 0.0043 = 0.100 ± 0.128. The interval contains 0, therefore the difference may not be statistically significant. 13

Student s t-test Student s t-test tells us whether the means of two samples are significantly different Take individual samples from the sets of all possible cross-validation estimates Use a paired t-test because the individual samples are paired The same CV is applied twice

Two-tailed t-test

Comparing Two Classifiers Suppose we want to compare the performance of two classifiers using the k-fold cross-validation approach. Assume we did 10-fold CV for two classifiers We want to know if there is a statistically significant difference between the two means.

Comparing Algorithms A and B Partition data D into k stratified disjoint subsets T 1, T 2,, T k of equal size. For i = 1 to k do Use T i as the testing set, and the remaining data for training set S i S i D T i h A L A S i h B L B S i δ i error i h A error i h B Return δ, where k δ 1 k i=1 δ i

Comparing Classifiers A and B The difference of the means also has a Student s distribution with k 1 degrees of freedom N% confidence interval for δ: δ ± t N,k 1 s δ k s δ 1 k k 1 i=1 δ i δ 2 t N,K 1 = δ s δ

Performing the t-test 1. Fix a significance level α If a difference is significant at the α% level, there is a 100 α % chance that there really is a difference 2. Divide the significance level by two, because the test is twotailed i.e., the true difference can be positive or negative 3. If t N,k 1 < t or t t N,k 1 then the difference is significant i.e., the null hypothesis can be rejected

One-tailed t-test

Unpaired Observations If the CV estimates are from different randomizations, they are no longer paired Then we have to use an unpaired t-test with min k, j 1 degrees of freedom The t-statistic becomes: t = m d σ 2 t = d k m x m y σ x 2 k + σ y j 2

Evaluation Measures Summary What you should know? Confidence intervals Evaluation schemes hold-out, 10-fold CV, bootstrap, etc. Significance tests Different evaluation measures for classification Error/accuracy, ROC, f-measure, lift curves, cost-sensitive classification