Understanding Generalization Error: Bounds and Decompositions

CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed in the lecture and vice versa) Outline Introduction Generalization error bounds Finite function classes Infinite function classes: VC dimension Estimation-approximation error decomposition Bias-variance decomposition 1 Introduction In many learning algorithms, we have some flexibility in choosing a model complexity parameter: the degree of a polynomial kernel; the number of hidden nodes in a neural network; the depth of a decision tree; the number of neighbors in nearest neighbor methods; and so on We have seen that as the model complexity increases, the training error generally decreases, but the generalization or test) error generally has a U shape: it is high for models of low complexity, decreases until the model complexity matches the unknown data distribution, and then becomes high again for models of higher complexity: 1

2 Understanding Generalization Error: Bounds and Decompositions Models of low complexity tend to underfit the data: they are not flexible enough to adequately describe patterns in the data On the other hand, models of high complexity tend to overfit the data: they are so flexible that they fit themselves not only to broad patterns, but also to various types of spurious noise in the particular training data, and so do not generalize well In general, our goal is to select a model complexity parameter that leads to neither underfitting nor overfitting, ie that leads to low generalization error The challenge, of course, is that the right model complexity depends on the unknown data distribution, and so must also be estimated from the data itself This is known as the model selection problem So far, we have relied on cross-validation as a means to estimate the generalization error for various model complexities, and to thereby solve the model selection problem However, cross-validation has two disadvantages: 1) it requires making the model selection decision based on training on a smaller number of data points than actually available since some points need to be held out for validation purposes); 2) it requires training several models for each model complexity parameter under consideration, and is therefore computationally expensive Wouldn t it be nice if, for each model complexity parameter under consideration, we could just train a model once on the full training data available, and somehow estimate the generalization error from the training error of the resulting model? In this lecture, we have two goals First, we will introduce the notion of generalization error bounds These give bounds on the generalization error of a learned model in terms of its training error There are several types of generalization error bounds that make use of different properties of the learning algorithm and/or data involved We will describe the simplest type, which is a uniform convergence bound based on the capacity of the function class searched by an algorithm; in doing so, we will also introduce the Vapnik- Chervonenkis VC) dimension, which is one widely studied measure of the capacity of a binary-valued) function class In practice, most generalization error bounds, particularly those that hold for all data distributions such as the ones we will discuss here), are quite loose, and would require a very large training sample in order to actually provide useful estimates of the generalization error However, even when they are loose, these bounds can often be useful for model selection purposes Second, we will try to better understand some of the components that contribute to the overall generalization error In particular, we will try to formalize our intuition about underfitting and overfitting by considering two types of decompositions of the generalization error: a decomposition based on notions of estimation error and approximation error, and a decomposition based on notions of bias and variance These decompositions are useful in understanding various practices in machine learning and when/why they can be helpful: for example, the estimation-approximation error decomposition is useful in motivating the practice of structural risk minimization, and the bias-variance decomposition is useful in understanding when/why the practice of bootstrap aggregation bagging) can be helpful The broad notions we will discuss are applicable in many learning settings, but to keep things concrete, we will focus our discussion mostly on binary classification und loss) The main exception will be when we discuss the bias-variance decomposition, which is most naturally discussed in the context of regression under squared loss) 2 Generalization Error Bounds In this section, our goal is provide bounds on the generalization error of a learned model in terms of its training error As discussed above, we will focus here on binary classification und loss, although the broad ideas apply more generally Let D be a probability distribution on X {±1}, and let S = x 1, y 1 ),, x m, y m )) D m be a training sample containing m labeled examples drawn iid from D Suppose we learn a binary classifier h S : X {±1}

Understanding Generalization Error: Bounds and Decompositions 3 from S, and observe its training error: êr 0-1 S [h S ] = 1 m Our goal is obtain bounds on the generalization error of h S : m 1h S x i ) y i ) i=1 D [h S ] = E X,Y ) D [ 1hS X) Y ) ] Since the training error êr 0-1 S [h S ] is calculated using the same sample S from which the model h S is learned, it is typically smaller than the generalization error D [h S] A generalization error bound provides a high confidence bound on the difference D [h S] êr 0-1 S [h S ] one-sided bound) or on the absolute difference D [h S] êr 0-1 S [h S ] two-sided bound) As noted above, there are many types of generalization error bounds that make use of different properties of the learning algorithm used and/or the data distribution involved We will consider here the simplest type of bound which will depend only on the function class H searched by the algorithm eg H could be the class of linear classifiers, or the class of quadratic classifiers, etc) We start with the following classical concentration inequality, which gives a high confidence bound on the deviation of the fraction of times a biased coin comes up heads from its expected value: Theorem 1 Hoeffding s inequality for iid Bernoulli random variables) Let X 1,, X m be iid Bernoulli random variables with parameter p, and let X = 1 m m i=1 X i Then for any ɛ > 0, P X p ɛ) e 2mɛ2 and P p X ɛ) e 2mɛ2 Applying Hoeffding s inequality to a fixed classifier h that is independent of the sample S, it is easy to see that for any ɛ > 0, 1 ) P S Dm S [h] ɛ e 2mɛ2 Equivalently, for any 0 < δ 1, by setting e 2mɛ2 = δ and solving for ɛ, we have that with probability at least 1 δ, ln1/δ) S [h] 2m However, this reasoning does not apply to the learned classifier h S, since it depends on the training sample S 2 In order to obtain a bound on the generalization error of h S, we need to do a little more work To provide some intuition, we start by discussing the case when h S is learned from a finite function class H; we ll then discuss the more general and more realistic) case when H can be infinite 21 Finite Function Classes Consider first the case when the function class H from which h S is learned is finite In this case, we can apply Hoeffding s inequality to each classifier h in H separately, and then can use the union bound to obtain the following uniform bound that holds simultaneously for all classifiers in H: 1 To see this, set X i = 1hx i ) y i ); then the X i s are iid Bernoulli random variables with parameter p = er D [h] 2 In particular, the random variables 1h S x i ) y i ) are not independent, since they depend on the full sample S

4 Understanding Generalization Error: Bounds and Decompositions Theorem 2 Uniform error bound for finite H) Let H be finite Then for any ɛ > 0, P S D m max er 0-1 S [h] ) ) ɛ H e 2mɛ2 Proof We have, P S D m max er 0-1 S [h] ) ) ɛ { }) = P S D m [h] ɛ S S ) P S D m [h] ɛ, by the union bound e 2mɛ2, by Hoeffding s inequality = H e 2mɛ2 In other words, for any 0 < δ 1, we have that with probability at least 1 δ, max er 0-1 S [h] ) ln H + ln1/δ) 2m Since the above bound holds uniformly for all classifiers in H, it follows that it holds in particular for the classifier h S selected by a learning algorithm Therefore we have the following generalization error bound for the classifier h S learned by any algorithm that searches a finite function class H: with probability at least 1 δ, ln H + ln1/δ) D [h S ] êr 0-1 S [h S ] + 2m The bound becomes smaller when the number of training examples m increases, or when the confidence parameter δ is loosened increased) The quantity ln H here acts as a measure of the capacity of the function class H: as the capacity of H increases the algorithm has more flexibility to search over a larger function class), the guarantee on the difference between the generalization error and training error becomes weaker the bound becomes larger) 22 Infinite Function Classes and VC dimension In practice, most learning algorithms we have seen learn a classifier from an infinite function class H In this case, we cannot use ln H to measure the capacity of H; we need a different notion One such widely used notion is the Vapnik-Chervonenkis VC) dimension of a class of binary-valued functions H Definition Shattering and VC dimension) Let H be a class of {±1}-valued functions on X We say a set of m points {x 1,, x m } X is shattered by H if all possible 2 m binary labelings of the points can be realized by functions in H The VC dimension of H, denoted by VCdimH), is the cardinality of the largest set of points in X that can be shattered by H If H shatters arbitrarily large sets of points in X, then VCdimH) = As an example, consider the class of linear classifiers of the form hx) = signw x + b) in 2 dimensions, X = R 2 Figure 1 shows a set of 3 points in R 2 that are shattered by this class Moreover, it can be verified that no set of 4 points is shattered by this class Therefore the VC dimension of the class of linear classifiers in R 2 is 3 More generally, the VC dimension of the class of linear classifiers in d dimensions, X = R d, is known to be d + 1

Understanding Generalization Error: Bounds and Decompositions 5 Figure 1: Three points in R 2 that can be shattered using linear classifiers For any function class H that has finite VC dimension, we have the following uniform bound that holds simultaneously for all classifiers in H: Theorem 3 Uniform error bound for general H) Let VCdimH) be finite Then for any ɛ > 0, 3 P S D m sup er 0-1 S [h] ) ) ɛ 4 2em ) VCdimH) e mɛ 2 /8 The proof of this result involves advanced machinery which we will not discuss here For our purposes, this yields the following generalization error bound for the classifier h S learned by any algorithm that searches a function class H of finite VC dimension: with probability at least 1 δ, D [h S ] êr 0-1 S [h S ] + 8 VCdimH) ln2m) + 1 ) ) + ln4/δ) m As before, the bound becomes smaller when the number of training examples m increases, or when the confidence parameter δ is loosened increased) The capacity of the function class H is now measured by VCdimH) The above bound is distribution-free, in that it holds for any distribution D This is both a strength and a weakness: it is a strength since it does not require any assumptions on D, but it is also a weakness since it means the bound must hold even for the worst-case distribution, and will therefore be loose for most distributions There are various other types of generalization error bounds that can be tighter than simple capacity based uniform bounds: some that are distribution-free but that involve data-dependent capacity/complexity measures eg Rademacher complexities); others that are distribution-free, but rather than giving a uniform bound for all functions in a class, bound the error of the learned classifier directly by using other properties of the learning algorithm eg algorithmic stability); and yet others that require assumptions on the distribution In general, for small/moderate sample sizes m, most generalization error bounds are too loose to be used as absolute estimates of the generalization error However, if the bounds are such that they accurately track the relative behavior of the generalization error across different function classes/algorithms, then they can be useful for model selection For example, to use the VC dimension based bound above for model selection, one would train classifiers on the given training data from different function classes, compare the VC dimension based upper bounds on the generalization errors of the learned classifiers at some suitable confidence level δ, and then select the classifier with the smallest value of this upper bound 3 Note that there are tighter versions of this bound; we state a basic version here for simplicity

6 Understanding Generalization Error: Bounds and Decompositions Figure 2: For a fixed sample size, as model complexity increases, the approximation error decreases, while the estimation error increases A high value of either contributes to a high generalization error: high approximation error is associated with underfitting; high estimation error is associated with overfitting 3 Estimation-Approximation Error Decomposition Some insight into the generalization error of a classifier h S learned from a function class H can be obtained by decomposing it as follows: ) ) D [h S ] = D [h S ] inf er0-1 D [h] + inf er0-1 D [h], D +, D 1) }{{}}{{}}{{} Irreducible Bayes error Estimation error in H Approximation error of H Recall that the Bayes error is the smallest possible generalization error over all possible classifiers; it is an irreducible error associated with the distribution D, sometimes also called the noise intrinsic to D The approximation error of H measures how far the best classifier in H is from the Bayes optimal classifier; it is a property of the function class H The estimation error measures how far the learned classifier h S is from the best classifier in H; this is a property of the learning algorithm, and depends on the training sample S for a good learning algorithm, one would expect that the estimation error would become smaller with increasing sample size m) In general, there is a tradeoff between the estimation error and approximation error Indeed, for a fixed training sample size m, as the model complexity here, complexity of the function class H) increases, we would expect the approximation error to decrease, and the estimation error to increase see Figure 2) Thus high approximation error is associated with underfitting; on the other hand, high estimation error is associated with overfitting For a learning algorithm to be statistically consistent, ie for its generalization error to converge to the Bayes error as m, it is clear that we must find a way to make both the estimation error and the approximation error converge to zero This is typically done via structural risk minimization, where one allows the function class H to grow with the sample size m so that the approximation error goes to zero), but does so slowly enough that one can still estimate a good function in the class so that the estimation error also goes to zero) We will come back to this at the end of the course 4 Bias-Variance Decomposition Another type of decomposition that is often useful in analyzing generalization error is the bias-variance decomposition In this case, it is most natural to discuss this decomposition in the context of regression under squared loss Therefore, in this section, we let D be a probability distribution on X R, and let

Understanding Generalization Error: Bounds and Decompositions 7 Figure 3: For a fixed sample size, as model complexity increases, the bias typically decreases, while the variance typically increases A high value of either contributes to a high average) generalization error: high bias is associated with underfitting; high variance is associated with overfitting S = x 1, y 1 ),, x m, y m )) D m be a training sample containing m labeled examples drawn iid from D We denote by f S : X R the regression model learned by an algorithm from S, and denote the training and generalization errors of f S as follows: êr sq S [f S] = 1 m m yi f S x i ) ) 2 i=1 er sq D [f S] = E X,Y ) D [ Y fs X)) 2] The bias-variance decomposition aims to understand the average or expected generalization error of the models f S that would result if we trained an algorithm on several different training samples S The analysis below applies both if we consider the full expectation over all samples S drawn from D m, and if we consider an average over some finite number of random samples S; in both cases, we will simply write E S to denote this expectation Our goal, then, is to understand the behavior of E S [er sq D [f S]] In order to analyze the average generalization error E S [er sq D [f S]], it will be useful to also introduce an average model f : X R, whose prediction on an instance x is obtained by simply averaging over the predictions of the individually trained models f S : fx) = E S [f S x)] Then the average generalization error can be decomposed as follows: [ E S [er sq D [f [ S]] = E X E S fs X) fx)) 2]] [ + E X fx) f X)) 2] + er sq, D }{{}}{{}}{{} Variance Bias 2 Irreducible error Recall that er sq, D = E X[Var[Y X]] is the irreducible error or intrinsic noise associated with D, and that f x) = E[Y X = x] is the optimal regression model The squared) bias term measures how far the average model f is from the optimal model f The variance term measures how much, on average, a model f S learned from a particular random sample S bounces around the average model f Again, there is a tradeoff between the bias and variance terms: for a fixed training sample size m, as the model complexity increases, we would expect the bias term to decrease, and the variance term to increase see Figure 3) Thus high bias is associated with underfitting; on the other hand, high variance is associated with overfitting Note that here, the notions of bias and variance apply to an algorithm, not necessarily to a function class; so for example, it is possible for two algorithms that both search the same function class to have different bias and variance properties

8 Understanding Generalization Error: Bounds and Decompositions The variance term is related to the stability of an algorithm: an algorithm with high variance has low stability, in the sense that changing the training sample S a little can produce a very different model f S The practice of bootstrap aggregation bagging), where one creates multiple randomly selected bootstrap samples from a given training sample S and aggregates averages) the models learned from the various bootstrap samples, can be viewed as a practice aimed at reducing variance This is especially useful in reducing the error of algorithms that otherwise have high variance, such as decision tree learning algorithms indeed, random forests, which apply bagging and random feature selection procedures to decision trees, often have improved performance over algorithms that learn a single decision tree) Unlike the estimation-approximation error decomposition, whose correctness can be verified by simple visual inspection, the bias-variance decomposition needs a little work to derive It is easiest to first show the decomposition for a fixed instance x, and then take expectations over x; in particular, it can be shown that for any fixed x, [ [ E S EY X=x fs x) Y ) 2]] [ = E S fs x) fx)) 2] + fx) f x)) 2 + Var[Y X = x] }{{}}{{}}{{} Variance at x Bias 2 at x Irreducible error at x We leave the details as an exercise for the reader Exercise Show that the bias-variance decomposition is correct To do this, first establish that the decomposition shown above for a fixed instance x is correct, and then take expectations over x) Acknowledgments Thanks to Achintya Kundu for help in preparing Figure 1 as part of scribing a previous lecture by the instructor)