CS 543 Page 1 John E. Boon, Jr.

CS 543 Machine Learning Spring 2010 Lecture 05 Evaluating Hypotheses I. Overview A. Given observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over additional examples? B. Given that one hypothesis outperforms another over some sample of data, how probable is it that this hypothesis is more accurate in general? C. When data is limited what is the best way to use this data to both learn a hypothesis and estimate its accuracy? D. [statistical methods, assumptions about underlying distributions of data are the basic tools we ll use]. II. 5.1 Motivation A. Need to evaluate the performance of a learned hypothesis when it is important to understand whether or not to use the hypothesis (medical treatment effectiveness database and resulting learned hypothesis) B. Evaluating a hypothesis is a component of many learning methods (such as post-pruning decision trees to avoid overfitting) C. Estimating the accuracy of a hypothesis is straightforward when data is plentiful D. Difficulties in estimating the accuracy of a hypothesis with a limited set of data 1. Bias in the estimate Observed accuracy of the learned hypothesis over the training examples is a poor estimator of its accuracy over future examples (1) Typically provides an optimistically biased estimate of the hypothesis over future examples b) To obtain an unbiased estimate of future accuracy we test the hypothesis on some set of test examples chose independently of the training examples and the hypothesis 2. Variance in the estimate Even if the hypothesis accuracy is measured over an unbiased set of test examples independent of the training examples, the measured accuracy can still vary from the true accuracy b) The smaller the set of test examples the greater the expected variance. III. 5.2 Estimating Hypothesis Accuracy A. Introductory remarks 1. We want to know The accuracy of a hypothesis in classifying future examples b) The probable error in this accuracy estimate 2. Data and variables for this chapter X - the set of possible instances over which target functions may be defined (1) Assume that different instances of X may be encountered with different frequencies b) some unknown probability distribution that defines the probability of encountering each instance in X (1) says nothing about whether x is a positive or negative example; (2) it only determines the probability that x will be CS 543 Page 1 John E. Boon, Jr.

encountered c) learning task (1) f the target concept or target function (2) H the space of possible hypotheses 3. Training examples of the target function f are provided to the learner by a trainer who draws each instance independently according to distribution, and who then forwards the instance x along with the correct target value f(x) to the learner 4. Example page 130 Learn target function people who plan to purchase new skis this year 5. Given the general setting for learning we are interested in Given a hypothesis h and a data sample containing n examples drawn at random according to the distribution, what it the best estimate of the accuracy of h over future instances drawn from the same distribution? b) What is the probable error in this accuracy estimate? B. 5.2.1 Sample Error and True Error 1. Sample error - Error rate of the hypothesis over the sample of data that is available the sample error of a hypothesis with respect to some sample S of instances drawn from X is the fraction of S it misclassifies b) c) (1) n is the number of examples in S (2) 2. True error - Error rate of the hypothesis over the entire unknown distribution D of examples the true error of a hypothesis is the probability that it will misclassify a single randomly drawn instance from the distribution D. b) c) d) denotes the probability is taken over instance distribution 3. We usually wish to know the true error of the hypothesis (that is what we can expect when applying the hypothesis to future examples) 4. All we can measure is the sample error though 5. So How good is an estimate of true error is provided by the sample error? C. 5.2.2 Confidence Intervals for Discrete-Valued Hypotheses 1. Suppose we wish to estimate the true error from some discrete-valued hypothesis h over a sample S where The sample S contains n examples drawn independent of one another and independent of h according to probability distribution (1) (DISCUSS THIS ON BOARD) b) n 30 c) hypothesis h commits r errors over these n examples (1) CS 543 Page 2 John E. Boon, Jr.

2. Given no other information, the most probable value of is and 3. With approx 95% probability the true error lies within the interval 4. Illustrative example (page 131) (Discuss confidence interval interpretation on board) 5. See table 5.1 page 132 two-sided N% confidence intervals 6. (Excel worbook for examples) 7. (discuss why 68% CI is broader than 95% CI) 8. A more accurate rule of thumb is that this method of approximating of the true error works well when Section 5.2 Basic Confidence Interval Example n 40 (number of examples in data sample S ) r 12 (number of errors with respect to h ) sample error 0.30 sqrt term 0.0725 rule of thumb 8.4000 confidence interval s on true error N % z N lower CI upper CI 50 0.67 0.2515 0.3485 68 1.00 0.2275 0.3725 80 1.28 0.2073 0.3927 90 1.64 0.1812 0.4188 95 1.96 0.1580 0.4420 98 2.33 0.1312 0.4688 IV. 99 2.58 0.1131 0.4869 V. 5.3 Basics of Sampling Theory A. Summary Table 5.2 page 133 B. 5.3.1 Error Estimation and Estimating Binomial Properties 1. How does the deviation between sample error and true error depend on the size of the data sample? 2. When we measure the sample error we are performing an experiment with a random outcome Collect a random sample S of n independently drawn instances from the distribution and then measure the sample error b) As you repeat that step (experiment) each time drawing a different random sample S of size n we expect different values of the sample error c) is the error of the ith such experiment and is a random variable 3. Perform k such experiments and as k gets large the distribution approaches a Binomial Distribution C. 5.3.2 The Binomial Distribution CS 543 Page 3 John E. Boon, Jr.

1. Binomial distribution Binomial experiment properties (1) The experiment consists of n repeated trials (2) Each trial results in an outcome that may be classified as a success or a failure (3) The probability of success, denoted p, remains constant from trial to trial (4) The repeated trials are independent b) The number X of successes in n trials of a binomial experiment is called a binomial random variable c) Binomial Distribution (1) If a binomial trial can result in a success with probability p and a failure with probability q = 1-p, then the probability distribution of the binomial random variable X, the number of successes in n independent trials is, x = 0,1,2, n. 2. See Table 5.3 page 134 in text 3. Estimating p from a random sample of coin tosses is equivalent to estimating the true error from testing h on a random sample of instances. A single coin toss corresponds to drawing a single random instance from and determining whether it is misclassified by h b) The probability p that a single random coin toss will result in heads corresponds to the probability that a single instance drawn at random will be misclassified (1) p corresponds to c) the number r of heads observed over n coin tosses corresponds to the number of misclassification observed over n randomly drawn instances (1) r/n corresponds to 4. General setting to which the Binomial distribution applies is An underlying experiment whose outcome can be described by a random variable, Y, which can take on two possible values b) The probability Y = 1 on any single trial of the experiment is given by some constant p independent of the outcome of any other experiment. (p is usually not known in advance and the problem is to estimate p) c) A series of independent trials of the experiment is performed producing a sequence of IID r.v.y 1, Y 2, Y n. Let for this sequence of n experiments when Y i = 1 d) The probability that r.v.r will take on a specific value r is (1) D. 5.3.3 Mean and Variance 1. Expected value of a r.v.is the sum of (probability of a value * value) 2. (see equations 5.3 and 5.4 page 136) 3. How far a r.v.is expected to differ from its expected value 4. (see equation 5.5 page 136) Standard deviation equations 5.6 and 5.7 pgs 136-137 CS 543 Page 4 John E. Boon, Jr.

E. 5.3.4 Estimators, Bias, and Variance 1. Recast sample error and true error using the equation for the Binomial distribution, equation 5.2 page 136 2. and n is the number of instances in the sample S b) r is the number of instances from S misclassified by h c) p is the probability of misclassifying a single instance drawn from 3. is called an estimator for the true error 4. The estimation bias is the difference between the expected value of the estimator and the true value of the parameter Estimation bias of an estimator Y for arbitrary parameter p is E[Y]-p 5. If the estimation bias is zero then Y is an unbiased estimator of p 6. Is an unbiased estimator for? For the Binomial distribution E[r] = np b) For a constant n, E[r/n] = p c) So answer is yes an unbiased estimator for 7. At the start of the chapter said that testing the hypothesis on training examples provides an optimistically biased estimate of the hypothesis b) For to give an unbiased estimator for the hypothesis h and the sample S must be chosen independently 8. Given an choice among alternative unbiased estimators it makes sense to choose the unbiased estimator with lest variance Variance arises completely from the variance of r in our Binomial experiments b) Because r is Binomially distributed its variance is given in equation 5.7 page 137 (1) p is unknown, though, so we substitute our estimate r /n for p c) (see equations 5.8 and 5.9 page 138). Binomial probability distribution p 0.30 using estimate of r /n for p Pr(R =r ) 0.1366 E[Y ] 12 Var[Y ] 2.8983 std dev for sample error 0.0725 F. 5.3.5 Confidence Intervals 1. Definition page 138 2. To find a confidence interval knowing the mean and the standard deviation of a probability distribution only have to determine the area under the probability curve that contains the N% of the total probability distribution 3. Time invoke for large enough sample sizes the Binomial distribution can be closely approximated by the Normal distribution 4. (See table 5.4 page 139) 5. Review page 140 6. Remember that in calculating CS 543 Page 5 John E. Boon, Jr.

b) Two approximations were involved (1) To estimate the standard deviation of we have approximated by (Eq 5.8 Eq 5.9 page 138) ( Remember that (2) The Binomial distribution has been approximated by the Normal distribution 7. Rule of thumb is that these two approximations give good results when and n 30 G. 5.3.6 Two-Sided and One-Sided Bounds 1. What if we asked what is the probability that the true error is at most U? (will need a one-sided on the CI) 2. 100(1- )% confidence interval with lower L and upper U implies a 100(1- /2)% confidence interval with lower L and no upper b) a 100(1- /2)% confidence interval with upper U and no lower c) where is the probability that the value will fall into the unshaded region in Figure 5.1( page 140 d) /2 is the probability that the value will fall into the unshaded region in Figure 5.1(b) page 140 Section 5.2 Basic Confidence Interval Example n 40 (number of examples in data sample S ) r 12 (number of errors with respect to h ) sample error 0.30 sqrt term 0.0725 rule of thumb 8.4000 confidence interval s on true error Two-sided lower CI N % z N upper CI One-sided N % 50 0.67 0.2515 0.3485 0.50 75.0 68 1.00 0.2275 0.3725 0.32 84.0 80 1.28 0.2073 0.3927 0.20 90.0 90 1.64 0.1812 0.4188 0.10 95.0 95 1.96 0.1580 0.4420 0.05 97.5 98 2.33 0.1312 0.4688 0.02 99.0 99 2.58 0.1131 0.4869 0.01 99.5 VI. 5.4 A General Approach for Deriving Confidence Intervals A. Introductory remarks 1. View as a problem of estimating the mean of a population based on the CS 543 Page 6 John E. Boon, Jr.

mean of a randomly drawn sample of sample size n 2. General steps Identify the underlying population parameter p to be estimated (for example, ) b) Define the estimator Y (for example [try to choose a minimum variance, unbiased estimator] c) Determine the probability distribution Y that governs the estimator Y including its mean and variance d) Determine the N% confidence interval by finding thresholds L and U such that N% of the mass in the probability Y distribution falls between L and U (area under the curve Y ) B. 5.4.1 Central Limit Theorem 1. The values of the n independently drawn random variables Y 1, Y 2, Y n. obey the same unknown underlying probability distribution 2. Let be the mean of the unknown distribution governing each Y i b) be the standard deviation of the unknown distribution 3. The Y 1, Y 2, Y n are IID r.v. because they describe independent experiments each obeying the same underlying probability distribution 4. Compute the sample mean to estimate the true mean 5. Central limit theorem says that the probability distribution governing the sample mean approached a Normal distribution as n regardless of the distribution that governs the underlying distribution of the random variables Y i 6. Further states that the mean of the distribution governing the sample mean approaches the true mean and the standard deviation approaches 7. (see page 143) VII. 5.5 Difference in Error of Two Hypotheses A. Introductory remarks 1. h 1 has been tested on sample S 1 containing n 1 randomly drawn examples 2. h 2 has been tested on sample S 2 containing n 2 randomly drawn examples 3. Suppose we wish to estimate the difference d between the true errors of these two hypotheses 4. Step 1: identify d as the parameter to be estimated 5. Step 2: choose an estimator for d b) stated here but not proven 6. Step 3: what is the probability distribution governing the r.v.? Invoke Central Limit Theorem b) Difference between two Normal distributions is a Normal distribution c) fill follow a distribution that is approximately Normal with mean d d) Variance of this distribution is the sum of the variances of and e) (see equation 5.12 page 144) 7. Step 4: Determine the confidence intervals (see equation 5.13 page 144) for two-sided confidence interval CS 543 Page 7 John E. Boon, Jr.

8. What if h 1 and h 2 had been tested on single sample S containing n randomly drawn examples b) The variance in the new will usually be smaller than that in equation 5.12 because lost random differences between the two samples used in equation 5.12 c) Confidence interval from equation 5.13 will be overly conservative but still correct B. 5.5.1 Hypothesis Testing 1. Suppose you want to know What is the probability that? 2. For the example What is the probability that given the observed difference in the samples errors =.10? b) Pr(d > 0) is equal to the probability that has not overestimated d by more than.10 c) The probability that falls into the one-sided interval d) Since d is the mean of the distribution governing we can express this one-sided interval as e) We want the number of standard deviations allows from the mean (1) The value.10 is Hypothesis Testing sample error h 1 sample error h 2 d-hat n 1 n 2 Var[d-hat] stddev[d-hat] (2) For a one-sided test, =.10 is a 95% CI (3) The z N for this is 1.64 (4) Therefore f) Therefore for this problem example, the probability that given the observed difference in the samples errors =.10 is.95 (1) We accept the hypothesis that with confidence 0.95 (2) We reject the alternative hypothesis wt the (1-.95)=0.5 level of significance 0.3 0.2 0.1 100 100 0.0037 Eq 5.12 0.0608 VIII. 5.6 Comparing Learning Algorithms A. Introductory remarks 1. We are interested in comparing the performance of two learning algorithms L A and L B 2. One approach presented here in the text; reference cited for an alternative method CS 543 Page 8 John E. Boon, Jr.

3. What is the parameter we wish to measure? Determine which of two learning algorithms L A and L B is better on average for some particular target concept f (1) On average might mean the relative performance of the two algorithms averaged over all training sets of size n that might be drawn from the underlying distribution (2) Estimate the expected value of the difference in their errors (3) (see equation 5.14 page 146) (4) In practice we have only a limited sample D 0 of data when comparing learning methods ( Divide D 0 into a training set S 0 and a disjoint test set T 0 (b) (c) Use the training set for both learning algorithms Use the test data to compare the accuracy of the two learned hypotheses (d) (see equation 5.15 page 146) (i) Now use used to approximate (ii) Only measure the differences in errors for one training set S 0 rather than taking the expected value over all samples S that might be drawn from the distribution b) How can this estimator be improved (1) Repeatedly partition the data D 0 into disjoint training and test sets and take the mean of the test set errors for these experiments (2) Algorithm table 5.5 page 147 (3) The returned value is an estimate of the desired quantity in equation 5.14 (4) It is an estimate of the quantity in equation 5.16 (note the difference in how S is defined) ( S is a sample of size drawn uniformly from D 0 and k is the number of disjoint subsets of equal size used in the algorithm (5) N% confidence interval equation 5.17 page 147 ( Now using t-statistic instead of z N (6) Estimate of the standard deviation of the distribution governing in equation 5.18 page 147 c) Table 5.6 page 148 gives t-statistic values d) The procedure thus far (1) For comparing two learning methods (2) Involves testing the two learned hypotheses on identical test sets (3) Tests where the hypotheses are evaluated over identical samples are called pared tests ( Typically produce tighter confidence interval than unpaired tests (b) When hypotheses tested on separate data CS 543 Page 9 John E. Boon, Jr.

samples, differences in the two sample errors might be partially attributable to differences in the makeup of the two samples B. 5.6.1 Paired t Tests (details for analysis in previous section) 1. Summary of the estimation problem Given observed values of a set of IID r.v. Y 1, Y 2, Y k b) Wish to estimate the mean of the probability distribution governing these Y i c) The estimator we will use is the sample mean 2. Consider the following idealization of the algorithm in Table 5.5 Assume that we can request new training examples drawn according to the underlying instance probability distribution. b) Modify the algorithm so that on each iteration through the loop it generates a new training set S i and a new random test set T i by drawing from this underlying instance distribution c) I measured by the new procedure now correspond to the IID r.v. Y i d) The mean of the distribution corresponds to the expected difference in error between the two leaning methods (equation 5.14) e) The sample mean is the quantity computed by the idealized version of the algorithm. f) Now ask How good an estimate of is provided by? 3. Analysis We have a special case where the Y i are governed by an approximately Normal distribution (because test sets have 30 or more examples the I are approximately Normally distributed) b) We don t know the standard deviation of the distribution of the though c) Need the t-test in these cases (estimate the sample mean of a collection of IID Normally r.v.) (1) We can estimate the standard deviation of the sample mean using our and Y i values (2) (see unnumbered equation page 149) C. 5.6.2 Practical Considerations 1. In practice we are given a limited set of data D 0 and use the algorithm in Table 5.5 2. Statistical foundations require a sample containing k independent, IID Normal r.v. and unlimited access to examples of the target function 3. In practice, the only way to generate new estimates I (see the algorithm step) for the difference between the errors in the two learning algorithms is to resample D 0 dividing it into training and test sets in different ways Now the I are not independent of one another 4. Algorithm in table 5.5 implements a k-fold method. Each example from D 0 is used exactly once in a test set and (k -1) times in a training set 5. Might randomly choose a test set of at least 30 examples from D 0 and use the remaining examples for training (repeat as many times as desired) Advantage can be repeated an infinite number of times b) Disadvantage test sets no longer qualify as being independently CS 543 Page 10 John E. Boon, Jr.

drawn with respect to the underlying instance distribution IX. 5.7 Summary and Further Reading X. Suggested HW 5.2, 5.3, 5.4 CS 543 Page 11 John E. Boon, Jr.