Evaluating Hypotheses

Similar documents
[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Stephen Scott.

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

CHAPTER EVALUATING HYPOTHESES 5.1 MOTIVATION

CS 543 Page 1 John E. Boon, Jr.

Performance Evaluation and Hypothesis Testing

Hypothesis Evaluation

How do we compare the relative performance among competing models?

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Empirical Evaluation (Ch 5)

CSE 312 Final Review: Section AA

Evaluation requires to define performance measures to be optimized

Evaluation. Andrea Passerini Machine Learning. Evaluation

z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests

Institute of Actuaries of India

Machine Learning

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

EXAM. Exam #1. Math 3342 Summer II, July 21, 2000 ANSWERS

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Advanced Herd Management Probabilities and distributions

Sociology 6Z03 Review II

Machine Learning

Ch 3: Multiple Linear Regression

Performance Evaluation and Comparison

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Ch 2: Simple Linear Regression

PHP2510: Principles of Biostatistics & Data Analysis. Lecture X: Hypothesis testing. PHP 2510 Lec 10: Hypothesis testing 1

Mark Scheme (Results) June 2008

10/4/2013. Hypothesis Testing & z-test. Hypothesis Testing. Hypothesis Testing

Machine Learning. Ensemble Methods. Manfred Huber

GEOMETRIC -discrete A discrete random variable R counts number of times needed before an event occurs

Slides for Data Mining by I. H. Witten and E. Frank

Computational Learning Theory

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Statistics 100A Homework 5 Solutions

Data Mining. Chapter 5. Credibility: Evaluating What s Been Learned

MTH U481 : SPRING 2009: PRACTICE PROBLEMS FOR FINAL

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

SCHOOL OF MATHEMATICS AND STATISTICS

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Econ 3790: Business and Economic Statistics. Instructor: Yogesh Uppal

Null Hypothesis Significance Testing p-values, significance level, power, t-tests Spring 2017

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

CSC314 / CSC763 Introduction to Machine Learning

Inference for Regression Simple Linear Regression

Difference between means - t-test /25

Statistics. Statistics

Discrete distribution. Fitting probability models to frequency data. Hypotheses for! 2 test. ! 2 Goodness-of-fit test

Simple and Multiple Linear Regression

Epidemiology Principle of Biostatistics Chapter 11 - Inference about probability in a single population. John Koval

Precept 4: Hypothesis Testing

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

STAT 285: Fall Semester Final Examination Solutions

Hypothesis tests

This does not cover everything on the final. Look at the posted practice problems for other topics.

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Design of Engineering Experiments

This paper is not to be removed from the Examination Halls

1 Statistical inference for a population mean

Empirical Risk Minimization, Model Selection, and Model Assessment

F79SM STATISTICAL METHODS

Bias-Variance in Machine Learning

Chapter 12 - Lecture 2 Inferences about regression coefficient

یادگیري ماشین. (Machine Learning) ارزیابی فرضیه ها دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی. Evaluating Hypothesis (بخش سوم)

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Introductory Econometrics. Review of statistics (Part II: Inference)

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Lecture 8. October 22, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

In a one-way ANOVA, the total sums of squares among observations is partitioned into two components: Sums of squares represent:

Simple Linear Regression

6.867 Machine Learning

BINF702 SPRING 2015 Chapter 7 Hypothesis Testing: One-Sample Inference

Math Bootcamp 2012 Miscellaneous

Announcements. Proposals graded

CHAPTER 9, 10. Similar to a courtroom trial. In trying a person for a crime, the jury needs to decide between one of two possibilities:

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Section 4.6 Simple Linear Regression

6.4 Type I and Type II Errors

Guidelines for Solving Probability Problems

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

Suppose that you have three coins. Coin A is fair, coin B shows heads with probability 0.6 and coin C shows heads with probability 0.8.

Basics on Probability. Jingrui He 09/11/2007

Correlation and Regression

Introduction to Bayesian Learning. Machine Learning Fall 2018

STA301- Statistics and Probability Solved Subjective From Final term Papers. STA301- Statistics and Probability Final Term Examination - Spring 2012

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Statistical Hypothesis Testing

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

TUTORIAL 8 SOLUTIONS #

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Basic Probability Reference Sheet

Algorithm-Independent Learning Issues

(Practice Version) Midterm Exam 2

Transcription:

Evaluating Hypotheses IEEE Expert, October 1996 1

Evaluating Hypotheses Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution, Central Limit Theorem Paired t tests Comparing learning methods 2

Evaluating Hypotheses and Learners Consider hypotheses H 1 and H 2 learned by learners L 1 and L 2 How to learn H and estimate accuracy with limited data? How well does observed accuracy of H over limited sample estimate accuracy over unseen data? If H 1 outperforms H 2 on sample, will H 1 outperform H 2 in general? Same conclusion for L 1 and L 2? 3

Two Definitions of Error The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D. error D (h) Pr [f(x) h(x)] x D The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies error S (h) 1 δ(f(x) h(x)) n x S Where δ(f(x) h(x)) is 1 if f(x) h(x), and 0 otherwise. How well does error S (h) estimate error D (h)? 4

Problems Estimating Error 1. Bias: If S is training set, error S (h) is optimistically biased bias E[error S (h)] error D (h) For unbiased estimate, h and S must be chosen independently 2. Variance: Even with unbiased S, error S (h) may still vary from error D (h) 5

Example Hypothesis h misclassifies 12 of the 40 examples in S error S (h) = 12 40 =.30 What is error D (h)? 6

Estimators Experiment: 1. choose sample S of size n according to distribution D 2. measure error S (h) error S (h) is a random variable (i.e., result of an experiment) error S (h) isanunbiasedestimator for error D (h) Given observed error S (h) what can we conclude about error D (h)? 7

Confidence Intervals If S contains n examples, drawn independently of h and each other n 30 Then With approximately 95% probability, error D (h) lies in interval error S (h) ± 1.96 error S (h)(1 error S (h)) n 8

Confidence Intervals If S contains n examples, drawn independently of h and each other n 30 Then With approximately N% probability, error D (h) lies in interval where error S (h) ± z N error S (h)(1 error S (h)) n N%: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 9

error S (h) is a Random Variable Rerun the experiment with different randomly drawn S (of size n) Probability of observing r misclassified examples: P(r) P (r) = 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Binomial distribution for n = 40, p = 0.3 0 0 5 10 15 20 25 30 35 40 n! r!(n r)! error D(h) r (1 error D (h)) n r 10

Binomial Probability Distribution P(r) 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Binomial distribution for n = 40, p = 0.3 0 0 5 10 15 20 25 30 35 40 n! P (r) = r!(n r)! pr (1 p) n r Probability P (r) ofr heads in n coin flips, if p =Pr(heads) Expected, or mean value of X, E[X], is E[X] n i=0 ip (i) =np Variance of X is Var(X) E[(X E[X]) 2 ]=np(1 p) Standard deviation of X, σ X,is σ X E[(X E[X]) 2 ]= np(1 p) 11

Normal Distribution Approximates Binomial error S (h) follows a Binomial distribution, with mean µ errors (h) = error D (h) standard deviation σ errors (h) σ errors (h) = error D (h)(1 error D (h)) n Approximate this by a Normal distribution with mean µ errors (h) = error D (h) standard deviation σ errors (h) σ errors (h) error S (h)(1 error S (h)) n 12

Normal Probability Distribution 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Normal distribution with mean 0, standard deviation 1 0-3 -2-1 0 1 2 3 p(x) = 1 2πσ 2 e 1 2 (x µ σ )2 The probability that X will fall into the interval (a, b) is given by b a p(x)dx Expected, or mean value of X, E[X], is Variance of X is E[X] =µ Var(X) =σ 2 Standard deviation of X, σ X,is σ X = σ 13

Normal Probability Distribution 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0-3 -2-1 0 1 2 3 80% of area (probability) lies in µ ± 1.28σ N% of area (probability) lies in µ ± z N σ N%: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 14

Confidence Intervals, More Correctly If S contains n examples, drawn independently of h and each other n 30 Then With approximately 95% probability, error S (h) lies in interval error D (h) ± 1.96 equivalently, error D (h) lies in interval error S (h) ± 1.96 which is approximately error S (h) ± 1.96 error D (h)(1 error D (h)) n error D (h)(1 error D (h)) n error S (h)(1 error S (h)) n 15

Two-Sided and One-Sided Bounds 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0-3 -2-1 0 1 2 3 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0-3 -2-1 0 1 2 3 If µ z N σ y µ + z N σ with confidence N = 100(1 α)% Then y µ + z N σ with confidence N = 100(1 α/2)% and µ z N σ y + with confidence N = 100(1 α/2)% Example: n = 40, r =12 Two-sided, 95% confidence (α =0.05) One-sided P (0.16 y 0.44) = 0.95 P (y 0.44) = P (y 0.16) = (1 α/2) = 0.975 16

Calculating Confidence Intervals 1. Pick parameter p to estimate error D (h) 2. Choose an estimator error S (h) 3. Determine probability distribution that governs estimator error S (h) governed by Binomial distribution, approximated by Normal when n 30 4. Find interval (L, U) such that N% of probability mass falls in the interval Use table of z N values 17

Central Limit Theorem Consider a set of independent, identically distributed random variables Y 1...Y n, all governed by an arbitrary probability distribution with mean µ and finite variance σ 2. Define the sample mean, Ȳ 1 n n i=1 Y i Central Limit Theorem. As n,the distribution governing Ȳ approaches a Normal distribution, with mean µ and variance σ2 n. 18

Difference Between Hypotheses Test h 1 on sample S 1, test h 2 on S 2 1. Pick parameter to estimate d error D (h 1 ) error D (h 2 ) 2. Choose an estimator ˆd error S1 (h 1 ) error S2 (h 2 ) 3. Determine probability distribution that governs estimator σ ˆd error S1 (h 1 )(1 error S1 (h 1 )) n 1 + error S 2 (h 2 )(1 error S2 (h 2 )) n 2 Find interval (L, U) such that N% of probability mass falls in the interval ˆd ± z N error S1 (h 1 )(1 error S1 (h 1 )) n 1 + error S 2 (h 2 )(1 error S2 (h 2 )) n 2 19

Hypothesis Testing Example P (error D (h 1 ) > error D (h 2 )) =? S 1 = S 2 = 100 error S1 (h 1 )=0.30 error S2 (h 2 )=0.20 ˆd =0.10 σ ˆd =0.061 P ( ˆd <µˆd +0.10) = probability ˆd does not overestimate d by more than 0.10 z N σ ˆd =0.10 z N =1.64 P ( ˆd <µˆd +1.64σ ˆd) =0.95 I.e., reject null hypothesis with 0.05 level of significance 20

Paired t test to compare h A,h B 1. Partition data into k disjoint test sets T 1,T 2,...,T k of equal size, where this size is at least 30. 2. For i from 1 to k, do δ i error Ti (h A ) error Ti (h B ) 3. Return the value δ, where δ 1 k k i=1 δ i N% confidence interval estimate for δ: δ ± t N,k 1 s δ 1 k s δ k(k 1) (δ i δ) 2 i=1 Note δ i approximately Normally distributed 21

Comparing learning algorithms L A and L B What we d like to estimate: E S D [error D (L A (S)) error D (L B (S))] where L(S) is the hypothesis output by learner L using training set S i.e., the expected difference in true error between hypotheses output by learners L A and L B, when trained using randomly selected training sets S drawn according to distribution D. But, given limited data D 0, what is a good estimator? could partition D 0 into training set S 0 and testing set T 0, and measure error T0 (L A (S 0 )) error T0 (L B (S 0 )) even better, repeat this many times and average the results (next slide) 22

Comparing learning algorithms L A and L B 1. Partition data D 0 into k disjoint test sets T 1,T 2,...,T k of equal size, where this size is at least 30. 2. For i from 1 to k, do use T i for the test set, and the remaining data for training set S i S i {D 0 T i } h A L A (S i ) h B L B (S i ) δ i error Ti (h A ) error Ti (h B ) 3. Return the value δ, where δ 1 k k i=1 δ i 23

Comparing learning algorithms L A and L B Notice we d like to use the paired t test on δ to obtain a confidence interval but not really correct, because the training sets in this algorithm are not independent (they overlap!) more correct to view algorithm as producing an estimate of instead of E S D0 [error D (L A (S)) error D (L B (S))] E S D [error D (L A (S)) error D (L B (S))] but even this approximation is better than no comparison 24

More on t-tests Good for comparing two learners But not for multiple pairs P(reject null hypothesis) grows exponentially with # of pairs Null hypothesis = learners perform equally α e (1 (1 α c ) m ) m = k(k 1)/2 α e = P(incorrectly rejecting all means equal) α c = P(incorrectly rejecting null hypothesis for one pair) k = # learners 25

Analysis of Variance (ANOVA) P(means are not equal) MS = Mean Square MS between = 1 df j (x j x) 2 b MS within = 1 df w j k (x jk x j ) 2 j = # groups (learners) k = # trials per group x j = mean of the trials for group j (i.e., error) x = mean of all trials for all groups df b = degrees of freedom between groups (j 1) df w = degrees of freedom within all groups j(k 1) If no difference between groups, then MS within = MS between ANOVA : F = MS between MS within Maps (F, df b,df w ) to probability of rejecting null hypothesis Increased F leads to decreased P(means are equal) 26

Summary Evaluating hypotheses Sample error vs. true error Confidence intervals for observed hypothesis error Error estimators Binomial distribution, Normal distribution, Central Limit Theorem Comparing hypotheses and learners Hypothesis testing Paired t tests Cross validation Analysis of variance 27