Machine Learning: Evaluation

Similar documents
How do we compare the relative performance among competing models?

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation requires to define performance measures to be optimized

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

LAB 2. HYPOTHESIS TESTING IN THE BIOLOGICAL SCIENCES- Part 2

BNAD 276 Lecture 10 Simple Linear Regression Model

Performance Evaluation and Comparison

Data Mining. Chapter 5. Credibility: Evaluating What s Been Learned

Slides for Data Mining by I. H. Witten and E. Frank

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Introduction to Signal Detection and Classification. Phani Chavali

Logistic Regression Analysis

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Table 1: Fish Biomass data set on 26 streams

Learning with multiple models. Boosting.

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Big Data Analytics. Lucas Rego Drumond

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

Supervised Learning. George Konidaris

B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015

(Foundation of Medical Statistics)

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

4/6/16. Non-parametric Test. Overview. Stephen Opiyo. Distinguish Parametric and Nonparametric Test Procedures

Statistical Analysis How do we know if it works? Group workbook: Cartoon from XKCD.com. Subscribe!

Statistical Machine Learning from Data

epochs epochs

You may not use your books/notes on this exam. You may use calculator.

Correlation Analysis

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Introduction to Machine Learning and Cross-Validation

Linear and Logistic Regression. Dr. Xiaowei Huang

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

CS 543 Page 1 John E. Boon, Jr.

Probability and Statistical Decision Theory

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Evaluating Hypotheses

Economics 583: Econometric Theory I A Primer on Asymptotics: Hypothesis Testing

Introduction to Statistical Inference Lecture 8: Linear regression, Tests and confidence intervals

Econometrics Midterm Examination Answers

Recitation 5. Inference and Power Calculations. Yiqing Xu. March 7, 2014 MIT

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Review 6. n 1 = 85 n 2 = 75 x 1 = x 2 = s 1 = 38.7 s 2 = 39.2

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Mathematical statistics

Chapter 14 Simple Linear Regression (A)

Chapter 12 - Lecture 2 Inferences about regression coefficient

Simple and Multiple Linear Regression

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Holdout and Cross-Validation Methods Overfitting Avoidance

10-701/ Machine Learning - Midterm Exam, Fall 2010

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Bayesian Learning (II)

Bootstrap, Jackknife and other resampling methods

Machine Learning and Pattern Recognition, Tutorial Sheet Number 3 Answers

Ch 2: Simple Linear Regression

Learning Time-Series Shapelets

Empirical Risk Minimization, Model Selection, and Model Assessment

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Simple Linear Regression

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

Stochastic Gradient Descent

Linear Methods for Classification

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Bagging and Other Ensemble Methods

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Qualifying Exam in Machine Learning

Hypothesis Testing One Sample Tests

Linear Classifiers: Expressiveness

Chapter 10: Analysis of variance (ANOVA)

Gradient Boosting, Continued

LECTURE 5. Introduction to Econometrics. Hypothesis testing

CE 3710: Uncertainty Analysis in Engineering

Simple Linear Regression

Terminology for Statistical Data

A Statistical Framework for Analysing Big Data Global Conference on Big Data for Official Statistics October, 2015 by S Tam, Chief

Difference between means - t-test /25

Learning Theory Continued

Probability and Statistics Notes

10-701/ Machine Learning, Fall

1.0 Hypothesis Testing

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Inferences for Regression

Machine Learning: Pattern Mining

Transcription:

Machine Learning: Evaluation Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Comparison of Algorithms

Comparison of Algorithms Is algorithm A better than algorithm B? depends on task/ dataset difference might be due to limited size of dataset Same quality measure and dataset(s) should be used to evaluate A and B.

Comparison Schemes Paired Test Both algorithms are trained on the same n datasets and use the same n test datasets. E.g. 10-fold CV: 1. train each A and B on fold f 2... f n, evaluate each on f 1. We get quality measures a 1 and b 1. 2. train each A and B on fold f 1, f 3... f n, evaluate each on f 2. We get quality measures a 2 and b 2. 3.... We get results a 1,..., a n and b 1,..., b n. The results a i and b i are paired. Reduces variance in the quality estimations.

Comparison Schemes Unpaired Test Given are n results for A and m for method B. E.g. one researcher has implemented logistic regression and evaluated it on the iris dataset with 10-fold CV. Another researcher has implemented a decision tree and evaluates it on iris with 5-fold CV. They can compare their results with an unpaired test.

Hypothesis testing Hypothesis Testing is a statistical method for comparing two test series. Hypothesis H 0 : There is no difference. This hypothesis is tested with a significance level α. E.g 5%. H 0 is also called the null hypothesis. A hypothesis test tries to falsify/ reject the null hypothesis. If we can reject the null hypothesis, the results are significantly different.

Hypothesis testing We deal with two testing schemes: Wald test for normal distributed data general test t-test for testing means of normal distributed data good for small sample sizes

Paired Wald-Test Data: n samples X1,..., X n for algorithm A Y1,..., Y n for algorithm B Xi and Y i are paired! Hypothesis: δ = 0

Paired Wald-Test Z i := X i Y i (because X i and Y i are paired!) δ = E(Z) = E(X ) E(Y ) δ is estimated by ˆδ = X Y ˆδ is a random variable the estimated standard error ŝe(ˆδ) of ˆδ is ŝe(ˆδ) = s 2 Z n with s 2 Z := 1 n 1 n (Z i Z) 2 i=1

Paired Wald-Test The normalized random variable W describing the error is: W := ˆδ ŝe(ˆδ) = 1 n(n 1) Z n (Z i Z) 2 As W N(0, 1) we can conclude that the difference is with probability of at least α in: C n = Z ± z α/2 1 n (Z i Z) n(n 1) 2 We can test our hypothesis δ = 0: Method 1: If 0 Cn than reject H 0. I.e. there should be a difference between the two algorithms. Method 2: If W > z α/2 then reject H 0. i=1 i=1

Unpaired Wald-Test Data: n samples of A, m samples of B X1,..., X n for algorithm A Y1,..., Y m for algorithm B Xi and Y i are independent! Hypothesis: δ = 0

Unpaired Wald-Test δ = E(X ) E(Y ) δ is estimated by ˆδ = X Y ˆδ is a random variable the estimated standard error ŝe(ˆδ) of ˆδ is sx 2 ŝe(ˆδ) = n + s2 Y m with su 2 := 1 k (U i U) 2 k 1 i=1

Unpaired Wald-Test The normalized random variable W describing the error is: W := ˆδ ŝe(ˆδ) = X Y s 2 X n + s2 Y m As W N(0, 1) we can conclude that the difference is with probability of at least α in: C n = X Y ± z α/2 s 2 X n + s2 Y m We can test our hypothesis δ = 0: Method 1: If 0 C n than reject H 0. I.e. there should be a difference between the two algorithms. Method 2: If W > zα/2 then reject H 0.

t-tests The Wald-test is based on W N(0, 1). For small sample sizes, the approximation with a normal is inaccurate. In fact for small sample sizes W is t-distributed: W t n 1 f (x) = α n (1 + x 2 ) n+1 2 n with α n = Γ ( ) n+1 2 Γ ( ) n 2

t-tests t-tests can be performed by using t n,α instead of z α. Where n are the degrees of freedom. Paired t-test: degrees of freedom: k = n 1 Method: If W > tk,α/2 then reject H 0. Unpaired t-test: degrees of freedom: k = min{n, m} 1 Method: If W > tk,α/2 then reject H 0.

Example Paired t-test with level α = 0.05 for the data: fold 1 fold 2 fold 3 fold 4 fold 5 Method A 88 89 92 90 90 Method B 92 90 91 89 91... see blackboard...