Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier h (hypothesis) on a sample S train of pairs drawn independently (of one another) and identically distributed according to D (i.i.d sample) Task is to estimate the generalization error (or true error or expected risk) of h over the entire distribution p: R[h] = δ(y, h(x))p(x, y)dxdy X Y Estimating the accuracy of a hypothesis Problem D is unknown need to approximate with a sample error (or empirical error): where r is the number of errors, n = S. R S [h] = 1 S (x,y) S δ(y, h(x)) = r n bias in estimate: accuracy on the training sample S train is an optimistically biased estimate need an independent i.i.d test sample S test variance in estimate: measured accuracy can be different from true accuracy because of the specificity of the (even unbiased) sample employed need to estimate such variance Probability distribution of empirical error estimator Binomial distribution R S [h] = r/n can be modeled as a random variable, measuring the probability of having r of positive outcomes (errors) in n trials (with p true unknown probability of error). Such random variable obeys a Binomial distribution (E[x] = np, Var[x] = np(1 p)) Unbiased estimator The estimation bias of an estimator Y of a parameter p is E[Y ] p The empirical error is an unbiased estimator of the true error p: E[r/n] p = E[r]/n p = np/n p = 0

Probability distribution of empirical error estimator Estimator standard deviation The standard deviation of n independently drawn tests of a random variable with standard deviation σ is: σ n = σ n = p(1 p) n (for Binomial distribution) Empirical standard deviation can be approximated replacing true error p with predicted error R S [h] RS [h](1 R S [h]) σ RS [h] n Confidence intervals Definition A N% confidence interval for some parameter p is the interval expected to contain p with probability N% Confidence interval for error estimation Supply the estimated error R S [h] with a N% confidence interval (typical value is 95%) The true error R[h] will fall within such interval with N% probability Confidence intervals Problem It is difficult to compute confidence intervals for the Binomial distribution Solution For large enough n, the Binomial distribution can be approximated with a Normal distribution with same mean and variance Alternative rules of thumb for valid approximation: n 30 np(1 p) 5 Computing 95% confidence interval 2

Two-sided bound 2.5% left-out probability on each tail of the distribution Recover value z such that p(x z) 2.5% from N(0, 1) distribution table multiply z by standard deviation obtaining: Hypothesis testing Test statistic R S [h] ± z RS [h](1 R S [h]) null hypothesis H 0 default hypothesis, for rejecting which evidence should be provided test statistic Given a sample of n realizations of random variables X 1,..., X n H 1, a test statistic is a statistic T = h(x 1,..., X n ) whose value is used to decide wether to reject H 0 or not. n Example Given a set of measurements X 1,..., X n, decide wether the actual value to be measured is zero. null hypothesis the actual value is zero test statistic sample mean: T = h(x 1,..., X n ) = 1 n n X i Hypothesis testing Glossary tail probability probability that T is at least as great (right tail) or at least as small (left tail) as the observed value t. p-value the probability of obtaining a value T at least as extreme as the one observed t, in case H 0 is true. Type I error reject the null hypothesis when it s true Type II error accept the null hypothesis when it s false significance level the largest acceptable probability for committing a type I error critical region set of values of T for which we reject the null hypothesis critical values values on the boundary of the critical region Hypothesis test example Setting A measuring device takes three speed measurements and returns their average. The measurement error can be modelled as a Gaussian N(0, 4). The maximum speed limit is 120. The acceptable percentage of wrongly fined drivers is 5% (significance level 0.05) 3

Is a measured speed of 121 significantly over the limit? Z-Test compute probability that the test statistic T is greater or equal than 121 in case the actual value is 120: Standardize distribution: Z = T µ σ2 /n = T 120 2/ 3 Compute probability from N(0, 1) distribution table: P (T 121) = P ( T 120 2/ 3 Verify if probability is greater or equal to the significance level: P (Z 0.87) = 0.1922 0.05 121 120 2/ ) = P (Z 0.87) 3 What is the critical value for the speed limit test? Procedure compute minimal value c such that P (T c) = 0.05: P (Z c 120 2/ 3 ) = 0.05 c 120 2/ = 1.645 (from distribution table) 3 c = 120 + 1.645 2 3 t-test Setting Given a set of values for X 1,..., X n i.i.d random variables Decide wether to reject the null hypothesis that their distribution has mean µ 0. Problem The variance of the distribution is unknown Solution Replace the unknown variance with an unbiased estimate: S 2 n = 1 n 1 n (X i X n ) 2 4

t-test The test The test statistics is given by the standardized (also called studentized) mean: T = X n µ 0 S n / n Assuming the dataset comes from an unknown Normal distribution, the test statistics has a t n 1 distribution under the null hypothesis The null hypothesis can be rejected at significance level α if: T t n 1,α/2 or T t n 1,α/2 t-test 5

t n 1 distribution bell-shaped distribution similar to the Normal one wider and shorter: reflects greater variance due to using S 2 n instead of the true unknown variance of the distribution. n 1 is the number of degrees of freedom of the distribution (related to number of independent events observed) t n 1 tends to the standardized normal z for n. 7

Comparing learning algorithms Task Given two learning algorithm L A and L B, we want to estimate if they are significantly different in learning function f: E S D [R[L A (S)] R[L B (S)]] that is the expected difference between generalization errors when training on a random sample is significantly different from zero. Comparing learning algorithms Problem D is unknown we must rely on a sample D 0. Solution: k-fold cross validation Split D 0 in k equal sized disjoint subsets T i. For i [1, k] train L A and L B on S i = D 0 \ T i return test L A and L B on T i compute δ i R Ti [L A (S i )] R Ti [L B (S i )] δ = 1 k k δ i Comparing learning algorithms: t-test Null hypothesis the mean error difference is zero t-test at significance level α: Note where: δ S δ S δ = t k 1,α/2 or δ S δ Sk 2/k = 1 k(k 1) paired test the two hypotheses where evaluated over identical samples t k 1,α/2 k (δ i δ) 2 two-tailed test if no prior knowledge can tell the direction of difference (otherwise use one-tailed test) 8

t-test example 10-fold cross validation Test errors: T i R Ti [L A (S i)] R Ti [L B (S i)] δ i T 1 0.81 0.80 0.01 T 2 0.82 0.77 0.05 T 3 0.84 0.70 0.14 T 4 0.78 0.83-0.05 T 5 0.85 0.80 0.05 T 6 0.86 0.78 0.08 T 7 0.82 0.75 0.07 T 8 0.83 0.80 0.03 T 9 0.82 0.78 0.04 T 10 0.81 0.77 0.04 Average error difference: δ = 1 10 10 δ i = 0.046 t-test example 10-fold cross validation Unbiased estimate of standard deviation: Standardized mean error difference: S δ = 1 10 (δ i 10 9 δ) 2 = 0.0154344 δ S δ = 0.046 0.0154344 = 2.98 t distribution for α = 0.05 and k = 10: Null hypothesis rejected, classifiers are different t k 1,α/2 = t 9,0.025 = 2.262 < 2.98 t-test example 9

t-distribution Table t The shaded area is equal to α for t = t α. d f t.100 t.050 t.025 t.010 t.005 1 3.078 6.314 12.706 31.821 63.657 2 1.886 2.920 4.303 6.965 9.925 3 1.638 2.353 3.182 4.541 5.841 4 1.533 2.132 2.776 3.747 4.604 5 1.476 2.015 2.571 3.365 4.032 6 1.440 1.943 2.447 3.143 3.707 7 1.415 1.895 2.365 2.998 3.499 8 1.397 1.860 2.306 2.896 3.355 9 1.383 1.833 2.262 2.821 3.250 10 1.372 1.812 2.228 2.764 3.169 11 1.363 1.796 2.201 2.718 3.106 12 1.356 1.782 2.179 2.681 3.055 13 1.350 1.771 2.160 2.650 3.012 14 1.345 1.761 2.145 2.624 2.977 15 1.341 1.753 2.131 2.602 2.947 16 1.337 1.746 2.120 2.583 2.921 17 1.333 1.740 2.110 2.567 2.898 18 1.330 1.734 2.101 2.552 2.878 19 1.328 1.729 2.093 2.539 2.861 20 1.325 1.725 2.086 2.528 2.845 21 1.323 1.721 2.080 2.518 2.831 22 1.321 1.717 2.074 2.508 2.819 23 1.319 1.714 2.069 2.500 2.807 24 1.318 1.711 2.064 2.492 2.797 25 1.316 1.708 2.060 2.485 2.787 26 1.315 1.706 2.056 2.479 2.779 27 1.314 1.703 2.052 2.473 2.771 28 1.313 1.701 2.048 2.467 2.763 29 1.311 1.699 2.045 2.462 2.756 30 1.310 1.697 2.042 2.457 2.750 32 1.309 1.694 2.037 2.449 2.738 34 1.307 1.691 2.032 2.441 2.728 36 1.306 1.688 2.028 2.434 2.719 38 1.304 1.686 2.024 2.429 2.712 1.282 1.645 1.960 2.326 2.576 Gilles Cazelais. Typeset with LATEX on April 20, 2006. 10