CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Similar documents
Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Performance Evaluation and Comparison

Stephen Scott.

How do we compare the relative performance among competing models?

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluating Hypotheses

Hypothesis Evaluation

Evaluation requires to define performance measures to be optimized

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Introduction to Supervised Learning. Performance Evaluation

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Stats notes Chapter 5 of Data Mining From Witten and Frank

Data Mining and Analysis: Fundamental Concepts and Algorithms

Performance Evaluation and Hypothesis Testing

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Data Mining. Chapter 5. Credibility: Evaluating What s Been Learned

Performance evaluation of binary classifiers

Diagnostics. Gad Kimmel

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Hypothesis tests

Empirical Evaluation (Ch 5)

Pointwise Exact Bootstrap Distributions of Cost Curves

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

CSC314 / CSC763 Introduction to Machine Learning

Empirical Risk Minimization, Model Selection, and Model Assessment

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Performance Evaluation

Machine Learning: Evaluation

Performance Evaluation

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

One-Way Analysis of Variance. With regression, we related two quantitative, typically continuous variables.

Classifier performance evaluation

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Machine Learning Linear Classification. Prof. Matteo Matteucci

Evaluation & Credibility Issues

CS 543 Page 1 John E. Boon, Jr.

Visual interpretation with normal approximation

Sociology 6Z03 Review II

Notes on Machine Learning for and

Summary of Chapters 7-9

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Algorithmisches Lernen/Machine Learning

Model Accuracy Measures

CHAPTER 9: HYPOTHESIS TESTING

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

CHAPTER 8. Test Procedures is a rule, based on sample data, for deciding whether to reject H 0 and contains:

Data Mining and Knowledge Discovery: Practice Notes

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

E. Alpaydın AERFAISS

EC2001 Econometrics 1 Dr. Jose Olmo Room D309

INTERVAL ESTIMATION AND HYPOTHESES TESTING

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Hypothesis Testing. ECE 3530 Spring Antonio Paiva

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Least Squares Classification

Bayesian Decision Theory

AMS7: WEEK 7. CLASS 1. More on Hypothesis Testing Monday May 11th, 2015

Last two weeks: Sample, population and sampling distributions finished with estimation & confidence intervals

Learning with multiple models. Boosting.

Summary: the confidence interval for the mean (σ 2 known) with gaussian assumption

Null Hypothesis Significance Testing p-values, significance level, power, t-tests

CHAPTER EVALUATING HYPOTHESES 5.1 MOTIVATION

Chapter 9 Inferences from Two Samples

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Introductory Econometrics. Review of statistics (Part II: Inference)

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Support. Dr. Johan Hagelbäck.

Significance Tests for Bizarre Measures in 2-Class Classification Tasks

Bayesian Learning (II)

8.1-4 Test of Hypotheses Based on a Single Sample

FINAL: CS 6375 (Machine Learning) Fall 2014

Partitioning the Parameter Space. Topic 18 Composite Hypotheses

Models, Data, Learning Problems

Lecture 7: Hypothesis Testing and ANOVA

Last week: Sample, population and sampling distributions finished with estimation & confidence intervals

Decision Trees. Tirgul 5

Statistical Inference

hypotheses. P-value Test for a 2 Sample z-test (Large Independent Samples) n > 30 P-value Test for a 2 Sample t-test (Small Samples) n < 30 Identify α

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

A Posteriori Corrections to Classification Methods.

Bias-Variance Tradeoff

Null Hypothesis Significance Testing p-values, significance level, power, t-tests Spring 2017

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Performance Measures Training and Testing Logging

Introduction to Signal Detection and Classification. Phani Chavali

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Empirical Risk Minimization Algorithms

BIO5312 Biostatistics Lecture 6: Statistical hypothesis testings

Transcription:

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1

IEEE Expert, October 1996 CptS 570 - Machine Learning 2

Given sample S from all possible examples D Learner L learns hypothesis h based on S Sample error: error S (h) True error: error D (h) Example Hypothesis h misclassifies 12 of 40 examples in S error S (h) = 0.3 What is error D (h)? CptS 570 - Machine Learning 3

Learner A learns hypothesis h A on sample S Learner B learns hypothesis h B on sample S Observe: error S (h A ) < error S (h B ) Is error D (h A ) < error D (h B )? Is learner A better than learner B? CptS 570 - Machine Learning 4

How can we estimate the true error of a classifier? How can we determine if one learner is better than another? Using sample error is too optimistic Using error on a separate test set is better, but might still be misleading Repeating above for multiple iterations, each with different training/testing sets, yields better estimate of true error

David Wolpert, 1995 For any learning algorithm there are datasets for which it does well, and datasets for which is does poorly Performance estimates are based on specific datasets, not an estimate of the learner on all datasets There is no one best learning algorithm CptS 570 - Machine Learning 6

Multiple iterations of learning on a training set and testing on a separate validation set are only for evaluation and parameter tuning Final learning should be done on all available data If the validation set is used to choose/tune a learning method, then it cannot also be used to compare performance against another learning algorithm Need yet another test set that is unseen during tuning/learning CptS 570 - Machine Learning 7

Error costs (false positives vs. false negatives) Training time and space complexity Testing time and space complexity Interpretability Ease of implementation CptS 570 - Machine Learning 8

Given dataset X For each of K trials Randomly divide X into training set (2/3) and testing set (1/3) Learn classifier on training set Test classifier on testing set (compute error) Compute average error over K trials Problem Training and testing sets overlap between trials Biases the results CptS 570 - Machine Learning 9

Given dataset X Partition X into K disjoint sets X 1,, X K For i = 1 to K Learn classifier on training set X X i Test classifier on testing set X i (compute error) Compute average error over K trials Testing sets no longer overlap Training sets still overlap CptS 570 - Machine Learning 10

Stratification Distribution of classes in training and testing sets should be the same as in original dataset Called stratified cross validation Leave-one-out cross validation K = N = X Used when classified data is scarce CptS 570 - Machine Learning 11

Tom Dietterich, 1998 For each of 5 trials (shuffling X each time) Divide X in two halves X 1 and X 2 Compute error using X 1 as training and X 2 as testing Compute error using X 2 as training and X 1 as testing Computer average error of all 10 results 5 trials best number to minimize overlap among training and testing sets CptS 570 - Machine Learning 12

If not enough data for k-fold cross validation Generate multiple samples of size N from X by sampling with replacement Each sample has approximately 63% of the examples in X Compute average error over all samples CptS 570 - Machine Learning 13

Confusion matrix Predicted class True class Positive Negative Total Positive tp: true positive fn: false negative p Negative fp: false positive tn: true negative n Total p n N CptS 570 - Machine Learning 14

Name error accuracy tp-rate fp-rate precision recall sensitivity specificity Formula (fp + fn)/n (tp + tn)/n tp/p fp/n tp/p tp/p = tp_rate tp/p = tp_rate tn/n = 1 fp_rate F-measure: F = 2 precision recall precision + recall CptS 570 - Machine Learning 15

=== Run information === Scheme: weka.classifiers.rules.oner -B 6 Relation: labor-neg-data Instances: 57 Attributes: 17 duration wage-increase-first-year wage-increase-second-year wage-increase-third-year cost-of-living-adjustment working-hours pension standby-pay shift-differential education-allowance statutory-holidays vacation longterm-disability-assistance contribution-to-dental-plan bereavement-assistance contribution-to-health-plan class CptS 570 - Machine Learning 16

Test mode: 10-fold cross-validation === Classifier model (full training set) === wage-increase-first-year: < 2.9 -> bad >= 2.9 -> good? -> good (48/57 instances correct) Time taken to build model: 0 seconds CptS 570 - Machine Learning 17

=== Stratified cross-validation === === Summary === Correctly Classified Instances 43 75.4386 % Incorrectly Classified Instances 14 24.5614 % Kappa statistic 0.4063 Mean absolute error 0.2456 Root mean squared error 0.4956 Relative absolute error 53.6925 % Root relative squared error 103.7961 % Coverage of cases (0.95 level) 75.4386 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 57 CptS 570 - Machine Learning 18

=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.45 0.081 0.75 0.45 0.563 0.684 bad 0.919 0.55 0.756 0.919 0.829 0.684 good Weighted Avg. 0.754 0.385 0.754 0.754 0.736 0.684 === Confusion Matrix === a b <-- classified as 9 11 a = bad 3 34 b = good CptS 570 - Machine Learning 19

Most comparisons of machine learning algorithms use classification error Problems with this approach May be different costs associated with false positive and false negative errors Training data may not reflect true class distribution CptS 570 - Machine Learning 20

Receiver Operating Characteristic (ROC) Originated from signal detection theory Common in medical diagnosis Becoming common in ML evaluations ROC curves assess predictive behavior independent of error costs or class distributions Area Under ROC Curve (AUC) Single measure of learning algorithm performance independent of error costs and class distributions CptS 570 - Machine Learning 21

1.0 True positive rate 0.75 0.5 0.25 0 0 0.25 0.5 0.75 False positive rate Learner L1 Learner L2 Learner L3 Random 1.0 CptS 570 - Machine Learning 22

Learner L1 dominates L2 if L1 s ROC curve is always above L2 s curve If L1 dominates L2, then L1 better than L2 for all possible error costs and class distributions If neither dominates (L2 and L3), then different classifiers are better under different conditions CptS 570 - Machine Learning 23

Assume classifier outputs P(C x) instead of just C (the predicted class for instance x) Let θ be a threshold such that if P(C x) > θ, then x is classified as C, else not C Compute fp-rate and tp-rate for different values of θ from 0 to 1 Plot each (fp-rate, tp-rate) and interpolate (or convex hull) If multiple points with same fp-rate, then average tp-rates (k-fold cross-validation) CptS 570 - Machine Learning 24

What if classifier does not provide P(C x), but just C? E.g., decision tree, rule Generally, even these discrete classifiers maintain statistics for classification E.g., decision tree leaf nodes use proportion of examples of each class E.g., rules have the number of examples covered by the rule These statistics can be compared against a varying threshold (θ) CptS 570 - Machine Learning 25

ROC for J48 vs NN on Labor 1.2 1 True Positive Rate 0.8 0.6 0.4 J48 NN 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 False Positive Rate CptS 570 - Machine Learning 26

We have seen several ways to estimate learning performance Train/test split, cross-validation, ROC, AUC But how good are these at estimating the true performance? E.g., error S (h) ~ error D (h)? CptS 570 - Machine Learning 27

Estimate the mean μ of a normal distribution N(μ, σ 2 ) Given sample X = {x t } of size N Estimate m = Σ t x t /N, where m ~ N (μ, σ 2 /N) Define statistic Z with a unit normal distribution N(0,1): ( m µ ) σ / N ~ Z CptS 570 - Machine Learning 28

95% of Z lies in (-1.96,1.96) 99% of Z lies in (-2.58, 2.58) P(-1.96 < Z < 1.96) = 0.95 Two-sided confidence interval CptS 570 - Machine Learning 29

CptS 570 - Machine Learning 30 ( ) α σ µ σ σ µ σ σ µ α α = + < < = + < < = < < 1 0.95 1.96 1.96 0.95 1.96 1.96 2 / 2 / N z m N z m P N m N m P m N P z α/2 1-α 2.58 0.99 2.33 0.98 1.96 0.95 1.64 0.90 1.28 0.80 1.00 0.68 0.67 0.50

CptS 570 - Machine Learning 31 α µ σ µ σ α = < = < 1 0.95 1.64 N z m P N m P z α 2.33 1.64 1.28 1-α 0.99 0.95 0.90 One-sided:

Previous analysis requires we know σ 2 We can use sample variance S 2 ( t ) 2 x m /( N ) 2 S = 1 t When x t ~ N(μ, σ 2 ), then (N 1)S 2 /σ 2 is chisquared with N 1 degrees of freedom Since m and S 2 are independent, then m is t-distributed with N 1 degrees of freedom N ( µ) / S CptS 570 - Machine Learning 32

Similar to normal, but with larger spread (longer tails) Corresponds to additional uncertainty with using sample variance CptS 570 - Machine Learning 33

When σ 2 not known E.g., t 0.025,9 =2.685, t 0.025,29 =2.364 (2-tailed) CptS 570 - Machine Learning 34 ( ) ( ) ( ) α µ µ α α = + < < = 1 1 1 2 1 2 1 2 2 N S t m N S t m P t S m N N m x S N N N t t, /, / ~ /

t x t 1 3.0 2 3.1 3 3.2 4 2.8 5 2.9 6 3.1 7 3.2 8 2.8 9 2.9 10 3.0 30 m = = 3.0 10 2 0.2 S = = 0.022, 9 α = 0.05, df = N P P = 1 = 0.149 9, { 3 0.127 µ 3 + 0.127} { 2.873 µ 3.127} = 0. 95 S t 0.025,9 = 0.95 = 2.685 CptS 570 - Machine Learning 35

Want to claim a hypothesis H 1 E.g., H 1 : error D (h) < 0.10 Define the opposite of H 1 to be the null hypothesis H 0 E.g., H 0 : error D (h) 0.10 Perform experiment collecting data about error D (h) With what probability can we reject H 0? CptS 570 - Machine Learning 36

Example Sample X = {x t } of size N from N(μ, σ 2 ) Estimate mean m = Σ t x t /N Want to test if μ equals some constant μ 0 Null hypothesis H 0 : μ = μ 0 Alternative hypothesis H 1 : μ μ 0 Reject H 0 if m too far from μ 0 CptS 570 - Machine Learning 37

Example (cont.) We fail to reject H 0 with level of significance α if μ 0 lies in the (1- α) confidence interval: N ( m µ ) σ 0 ( z z ) α / 2, α / 2 We reject H 0 if μ 0 falls outside this interval on either side (two-sided test) CptS 570 - Machine Learning 38

Example (cont.) One-sided test H 0 : μ μ 0 vs. H 1 : μ > μ 0 Fail to reject H 0 with level of significance α if N ( m µ ) σ 0 Reject H 0 if outside interval (, z ) α CptS 570 - Machine Learning 39

Example (cont.) If variance σ 2 unknown, use sample variance S 2 Statistic now described by student-t distribution Fail to reject H 0 with level of significance α if N ( m ) N µ S ( m µ ) Reject H 0 if outside interval S 0 ~ t N 1 ( t ) 0, α, N 1 CptS 570 - Machine Learning 40

Example (cont.) H 0 : μ μ 0 vs. H 1 : μ > μ 0 (one-sided) 30 µ 0 = 2.9, m = = 3.0 10 2 0.2 S = = 0.022, S = 0.149 9 α = 0.05, df = N 1 = 9, t N ( m µ 0) S Reject H 0 0.05,9 = 2.121 (,1.833) = 1.833 Note that t 0.03145,9 = 2.121 t x t 1 3.0 2 3.1 3 3.2 4 2.8 5 2.9 6 3.1 7 3.2 8 2.8 9 2.9 10 3.0 CptS 570 - Machine Learning 41

Learn classifier on training set Test classifier on test set V of size N Assume probability p of error by classifier X = number of errors made by classifier on V X described by binomial distribution P N { } j ( ) N j X = j = p 1 p j CptS 570 - Machine Learning 42

Test hypothesis H 0 : p p 0 vs. H 1 : p > p 0 Reject H 0 with significance α if { } j ( ) N j X e = p 1 p < α where e = p 0 N N e P j= 1 j 0 0 CptS 570 - Machine Learning 43

Approximating X with normal distribution X is sum of N independent random variables from same distribution X/N is approximately normal for large N with mean p 0 and variance p 0 (1- p 0 )/N (central limit theorem) X / N p0 ~ Z p0( 1 p0 )/ N Fail to reject H 0 (p p 0 ) with significance α if X p N p Reject H 0 if outside 0 / (1 0 p ) / N (, ) 0 z α (e.g., z 0.05 = 1.64) Works well for Np 5 and N(1-p) 5 CptS 570 - Machine Learning 44

Example Recall earlier example error S (h)=0.3, N = S = 40, X = 12 error D (h) = p (?) H 0 : p p 0, H 1 : p > p 0 X / N Let p 0 = 0.2, α = 0.05 Fail to reject H 0 p 0 (1 p 0 p 0 ) / N (, z ) 0.3 0.2 (, z 0.2*0.8 / 40 1.58 (,1.64) α 0.05 ) CptS 570 - Machine Learning 45

Example (cont.) What is the 95% (α=0.05) confidence interval around error D (h) = p? Let p 0 = error S (h) = 0.3 P p 0 α / 2 P 0.3 1.96 P P z p 0 ( 1 p0) p0(1 p0) < p < p0 + zα / 2 N N 0.3(0.7) 40 { 0.3 0.142 < p < 0.3 + 0.142} { 0.158 < p < 0.442} = 0. 95 < p < 0.3 + 1.96 = 0.95 0.3(0.7) 40 = = 1α 0.95 CptS 570 - Machine Learning 46

Evaluate learner on K training/testing sets yielding errors p i, 1 i K K K p i= 1 i 2 i 1, S = = m = K K ( m p ) S 0 t K 1 Reject H 0 with significance α if this value is greater than t α,k-1 Typically K is 10 or 30 (t 0.05,9 =1.83, t 0.05,29 =1.70) ~ ( p K i m) 1 2 CptS 570 - Machine Learning 47

K-fold cross-validated paired t test Paired test: Both learners get same train/test sets Use K-fold CV to get K training/testing folds p i1, p i 2 : Errors of learners 1 and 2 on fold i p i = p i1 p i2 : Paired difference on fold i Null hypothesis is whether p i has mean 0 CptS 570 - Machine Learning 48 ( ) ( ) ( ) 1 2, / 1 2, / 1 1 2 2 1 1 0, in Accept if ~ 0 1 0 : vs. 0 : = = = = = = K K K K i i K i i t t t s m K s m K K m p s K p m H H α α µ µ

Tester: weka.experiment.pairedcorrectedttester Analysing: Percent_correct Datasets: 8 Resultsets: 2 Confidence: 0.05 (two tailed) Sorted by: - Date: 10/6/10 12:00 AM Dataset (1) rules.on (2) bayes -------------------------------------------------- loan (100) 39.50 84.50 v contact-lenses (100) 72.17 76.17 iris (100) 93.53 95.53 labor-neg-data (100) 72.77 93.57 v segment (100) 63.33 81.12 v soybean (100) 39.75 92.94 v weather (100) 36.00 67.50 weather.symbolic (100) 38.00 57.50 -------------------------------------------------- (v/ /*) (4/4/0) CptS 570 - Machine Learning 49

Be careful when comparing more than two learners Each comparison has probability α of yielding an incorrect conclusion Incorrectly reject null hypothesis Incorrectly conclude learner A better than learner B Probability of at least one incorrect conclusion among c comparisons is (1-(1-α) c ) One approach: Analysis of variance (ANOVA) CptS 570 - Machine Learning 50

Evaluating a learning algorithm Error of learned hypotheses (and other measures) K-fold and 5x2 cross validation ROC curve and AUC Confidence in error estimate Comparing two learners