Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Similar documents
Evaluation requires to define performance measures to be optimized

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluating Hypotheses

How do we compare the relative performance among competing models?

Stephen Scott.

CS 543 Page 1 John E. Boon, Jr.

Empirical Evaluation (Ch 5)

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

CHAPTER EVALUATING HYPOTHESES 5.1 MOTIVATION

Performance Evaluation and Comparison

Summary: the confidence interval for the mean (σ 2 known) with gaussian assumption

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Performance Evaluation and Hypothesis Testing

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Empirical Risk Minimization, Model Selection, and Model Assessment

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

Hypothesis Evaluation

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Advanced Herd Management Probabilities and distributions

EC2001 Econometrics 1 Dr. Jose Olmo Room D309

CHAPTER 9, 10. Similar to a courtroom trial. In trying a person for a crime, the jury needs to decide between one of two possibilities:

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Review. December 4 th, Review

INTERVAL ESTIMATION AND HYPOTHESES TESTING

Data Mining. Chapter 5. Credibility: Evaluating What s Been Learned

Review of probability and statistics 1 / 31

GEOMETRIC -discrete A discrete random variable R counts number of times needed before an event occurs

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Performance Measures Training and Testing Logging

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Business Statistics. Lecture 10: Course Review

10/31/2012. One-Way ANOVA F-test

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Preliminary Statistics Lecture 5: Hypothesis Testing (Outline)

CSE 312 Final Review: Section AA

CENTRAL LIMIT THEOREM (CLT)

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Design of Engineering Experiments

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Chapter 9 Inferences from Two Samples

Week 1 Quantitative Analysis of Financial Markets Distributions A

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Business Statistics: Lecture 8: Introduction to Estimation & Hypothesis Testing

Bayesian Methods: Naïve Bayes

Probabilities & Statistics Revision

CONTINUOUS RANDOM VARIABLES

Machine Learning: Evaluation

Simple Linear Regression

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Single Sample Means. SOCY601 Alan Neustadtl

Machine Learning Linear Classification. Prof. Matteo Matteucci

10-701/ Machine Learning, Fall

, 0 x < 2. a. Find the probability that the text is checked out for more than half an hour but less than an hour. = (1/2)2

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

CHAPTER 8. Test Procedures is a rule, based on sample data, for deciding whether to reject H 0 and contains:

Study Ch. 9.3, #47 53 (45 51), 55 61, (55 59)

STA301- Statistics and Probability Solved Subjective From Final term Papers. STA301- Statistics and Probability Final Term Examination - Spring 2012

Two-Sample Inferential Statistics

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests

VTU Edusat Programme 16

Hypothesis Testing. ) the hypothesis that suggests no change from previous experience

Lecture 2 Machine Learning Review

Lectures 5 & 6: Hypothesis Testing

STAT 461/561- Assignments, Year 2015

Y i = η + ɛ i, i = 1,...,n.

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Sociology 6Z03 Review II

CBA4 is live in practice mode this week exam mode from Saturday!

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons

Intelligent Systems Statistical Machine Learning

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Two Sample Hypothesis Tests

Lecture Testing Hypotheses: The Neyman-Pearson Paradigm

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

HYPOTHESIS TESTING. Hypothesis Testing

Difference between means - t-test /25

Statistical inference (estimation, hypothesis tests, confidence intervals) Oct 2018

Practice Problems Section Problems

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Slides for Data Mining by I. H. Witten and E. Frank

Review of Statistics 101

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Institute of Actuaries of India

Statistical Machine Learning from Data

Statistics. Statistics

The Chi-Square Distributions

Simple and Multiple Linear Regression

41.2. Tests Concerning a Single Sample. Introduction. Prerequisites. Learning Outcomes

4 Hypothesis testing. 4.1 Types of hypothesis and types of error 4 HYPOTHESIS TESTING 49

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Transcription:

Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier h (hypothesis) on a sample S train of pairs drawn independently (of one another) and identically distributed according to D (i.i.d sample) Task is to estimate the generalization error (or true error or expected risk) of h over the entire distribution p: R[h] = δ(y, h(x))p(x, y)dxdy X Y Estimating the accuracy of a hypothesis Problem D is unknown need to approximate with a sample error (or empirical error): where r is the number of errors, n = S. R S [h] = 1 S (x,y) S δ(y, h(x)) = r n bias in estimate: accuracy on the training sample S train is an optimistically biased estimate need an independent i.i.d test sample S test variance in estimate: measured accuracy can be different from true accuracy because of the specificity of the (even unbiased) sample employed need to estimate such variance Probability distribution of empirical error estimator Binomial distribution R S [h] = r/n can be modeled as a random variable, measuring the probability of having r of positive outcomes (errors) in n trials (with p true unknown probability of error). Such random variable obeys a Binomial distribution (E[x] = np, Var[x] = np(1 p)) Unbiased estimator The estimation bias of an estimator Y of a parameter p is E[Y ] p The empirical error is an unbiased estimator of the true error p: E[r/n] p = E[r]/n p = np/n p = 0

Probability distribution of empirical error estimator Estimator standard deviation The standard deviation of n independently drawn tests of a random variable with standard deviation σ is: σ n = σ n = p(1 p) n (for Binomial distribution) Empirical standard deviation can be approximated replacing true error p with predicted error R S [h] RS [h](1 R S [h]) σ RS [h] n Confidence intervals Definition A N% confidence interval for some parameter p is the interval expected to contain p with probability N% Confidence interval for error estimation Supply the estimated error R S [h] with a N% confidence interval (typical value is 95%) The true error R[h] will fall within such interval with N% probability Confidence intervals Problem It is difficult to compute confidence intervals for the Binomial distribution Solution For large enough n, the Binomial distribution can be approximated with a Normal distribution with same mean and variance Alternative rules of thumb for valid approximation: n 30 np(1 p) 5 Computing 95% confidence interval 2

Two-sided bound 2.5% left-out probability on each tail of the distribution Recover value z such that p(x z) 2.5% from N(0, 1) distribution table multiply z by standard deviation obtaining: Hypothesis testing Test statistic R S [h] ± z RS [h](1 R S [h]) null hypothesis H 0 default hypothesis, for rejecting which evidence should be provided test statistic Given a sample of n realizations of random variables X 1,..., X n H 1, a test statistic is a statistic T = h(x 1,..., X n ) whose value is used to decide wether to reject H 0 or not. n Example Given a set of measurements X 1,..., X n, decide wether the actual value to be measured is zero. null hypothesis the actual value is zero test statistic sample mean: T = h(x 1,..., X n ) = 1 n n X i Hypothesis testing Glossary tail probability probability that T is at least as great (right tail) or at least as small (left tail) as the observed value t. p-value the probability of obtaining a value T at least as extreme as the one observed t, in case H 0 is true. Type I error reject the null hypothesis when it s true Type II error accept the null hypothesis when it s false significance level the largest acceptable probability for committing a type I error critical region set of values of T for which we reject the null hypothesis critical values values on the boundary of the critical region Hypothesis test example Setting A measuring device takes three speed measurements and returns their average. The measurement error can be modelled as a Gaussian N(0, 4). The maximum speed limit is 120. The acceptable percentage of wrongly fined drivers is 5% (significance level 0.05) 3

Is a measured speed of 121 significantly over the limit? Z-Test compute probability that the test statistic T is greater or equal than 121 in case the actual value is 120: Standardize distribution: Z = T µ σ2 /n = T 120 2/ 3 Compute probability from N(0, 1) distribution table: P (T 121) = P ( T 120 2/ 3 Verify if probability is greater or equal to the significance level: P (Z 0.87) = 0.1922 0.05 121 120 2/ ) = P (Z 0.87) 3 What is the critical value for the speed limit test? Procedure compute minimal value c such that P (T c) = 0.05: P (Z c 120 2/ 3 ) = 0.05 c 120 2/ = 1.645 (from distribution table) 3 c = 120 + 1.645 2 3 t-test Setting Given a set of values for X 1,..., X n i.i.d random variables Decide wether to reject the null hypothesis that their distribution has mean µ 0. Problem The variance of the distribution is unknown Solution Replace the unknown variance with an unbiased estimate: S 2 n = 1 n 1 n (X i X n ) 2 4

t-test The test The test statistics is given by the standardized (also called studentized) mean: T = X n µ 0 S n / n Assuming the dataset comes from an unknown Normal distribution, the test statistics has a t n 1 distribution under the null hypothesis The null hypothesis can be rejected at significance level α if: T t n 1,α/2 or T t n 1,α/2 t-test 5

6

t n 1 distribution bell-shaped distribution similar to the Normal one wider and shorter: reflects greater variance due to using S 2 n instead of the true unknown variance of the distribution. n 1 is the number of degrees of freedom of the distribution (related to number of independent events observed) t n 1 tends to the standardized normal z for n. 7

Comparing learning algorithms Task Given two learning algorithm L A and L B, we want to estimate if they are significantly different in learning function f: E S D [R[L A (S)] R[L B (S)]] that is the expected difference between generalization errors when training on a random sample is significantly different from zero. Comparing learning algorithms Problem D is unknown we must rely on a sample D 0. Solution: k-fold cross validation Split D 0 in k equal sized disjoint subsets T i. For i [1, k] train L A and L B on S i = D 0 \ T i return test L A and L B on T i compute δ i R Ti [L A (S i )] R Ti [L B (S i )] δ = 1 k k δ i Comparing learning algorithms: t-test Null hypothesis the mean error difference is zero t-test at significance level α: Note where: δ S δ S δ = t k 1,α/2 or δ S δ Sk 2/k = 1 k(k 1) paired test the two hypotheses where evaluated over identical samples t k 1,α/2 k (δ i δ) 2 two-tailed test if no prior knowledge can tell the direction of difference (otherwise use one-tailed test) 8

t-test example 10-fold cross validation Test errors: T i R Ti [L A (S i)] R Ti [L B (S i)] δ i T 1 0.81 0.80 0.01 T 2 0.82 0.77 0.05 T 3 0.84 0.70 0.14 T 4 0.78 0.83-0.05 T 5 0.85 0.80 0.05 T 6 0.86 0.78 0.08 T 7 0.82 0.75 0.07 T 8 0.83 0.80 0.03 T 9 0.82 0.78 0.04 T 10 0.81 0.77 0.04 Average error difference: δ = 1 10 10 δ i = 0.046 t-test example 10-fold cross validation Unbiased estimate of standard deviation: Standardized mean error difference: S δ = 1 10 (δ i 10 9 δ) 2 = 0.0154344 δ S δ = 0.046 0.0154344 = 2.98 t distribution for α = 0.05 and k = 10: Null hypothesis rejected, classifiers are different t k 1,α/2 = t 9,0.025 = 2.262 < 2.98 t-test example 9

t-distribution Table t The shaded area is equal to α for t = t α. d f t.100 t.050 t.025 t.010 t.005 1 3.078 6.314 12.706 31.821 63.657 2 1.886 2.920 4.303 6.965 9.925 3 1.638 2.353 3.182 4.541 5.841 4 1.533 2.132 2.776 3.747 4.604 5 1.476 2.015 2.571 3.365 4.032 6 1.440 1.943 2.447 3.143 3.707 7 1.415 1.895 2.365 2.998 3.499 8 1.397 1.860 2.306 2.896 3.355 9 1.383 1.833 2.262 2.821 3.250 10 1.372 1.812 2.228 2.764 3.169 11 1.363 1.796 2.201 2.718 3.106 12 1.356 1.782 2.179 2.681 3.055 13 1.350 1.771 2.160 2.650 3.012 14 1.345 1.761 2.145 2.624 2.977 15 1.341 1.753 2.131 2.602 2.947 16 1.337 1.746 2.120 2.583 2.921 17 1.333 1.740 2.110 2.567 2.898 18 1.330 1.734 2.101 2.552 2.878 19 1.328 1.729 2.093 2.539 2.861 20 1.325 1.725 2.086 2.528 2.845 21 1.323 1.721 2.080 2.518 2.831 22 1.321 1.717 2.074 2.508 2.819 23 1.319 1.714 2.069 2.500 2.807 24 1.318 1.711 2.064 2.492 2.797 25 1.316 1.708 2.060 2.485 2.787 26 1.315 1.706 2.056 2.479 2.779 27 1.314 1.703 2.052 2.473 2.771 28 1.313 1.701 2.048 2.467 2.763 29 1.311 1.699 2.045 2.462 2.756 30 1.310 1.697 2.042 2.457 2.750 32 1.309 1.694 2.037 2.449 2.738 34 1.307 1.691 2.032 2.441 2.728 36 1.306 1.688 2.028 2.434 2.719 38 1.304 1.686 2.024 2.429 2.712 1.282 1.645 1.960 2.326 2.576 Gilles Cazelais. Typeset with LATEX on April 20, 2006. 10