Model Complexity. Chapter 1

Size: px

Start display at page:

Download "Model Complexity. Chapter 1"

Maude Ross
5 years ago
Views:

1 Chapter 1 Model Complexity The intuitive notion of complexity has to do with combining a collection of parts to form a whole. Hopefully the parts are related as in terms in a model, subsets of a data set, or iterates of a procedure. Naturally, one doesn t want more complexity than necessary, Ockham s rule, but how much complexity is really needed is often unclear. Consider a variance bias tradeoff: A complex model may have many terms and hence small bias, but estimating the numerous coefficients may make the variance and hence MSE larger than for a simpler model. In these cases, parsimony is preferred over complexity. Or, more exactly, given the model class, the optimal amount of complexity can be regarded as a function of the bias and the sample size. The variance bias decomposition can be extended, at least in principle, to a more elaborate decomposition in which there are two bias terms to assess incorrect estimation within a model separately from the approximation error of the model class to the true model, and two variance terms, one for the variance of model identification within a class of models and one for parameter estimation once a model has been identified. It is easy to imagine even more elaborate decompositions by trying to separate bias due to model selection from bias due to parameter estimation, and biases and variances introduced from varying the model class from say splines to neural nets to additive models. Rather than looking at variance or bias per se, it is often reasonable to look at the capacity and fit of a model in a predictive context. Capacity refers to the ability of the model (or machine) to fit the training sample perfectly. If the model is so flexible that it can perform without error on the training sample, then it is fitting the noise as well as the signal (i.e., overfitting). In this case, capacity and fit are roughly analogous to variance and bias respectively. A model with too much capacity is like a lookup table giving back the data exactly and a model with too little capacity is like a constant function that ignores the data entirely. More fancifully, a model with too much capacity is like an officious bureaucrat who files each separate piece of paper in a different file folder even documents with titles like Procedure for Setting Security Alarm When Leaving Office and Procedure for Turning Off Security Alarm Upon Arriving at Office. A model with too little capacity is like a the other kind 1

2 2 Chapter 1. Model Complexity of bureaucrat who stores all documents in one vast vault labeled Work. The best predictive performance typically arises when the family of models strikes the right balance between the capacity of the family and its performance on future data. Capacity is related to complexity: A more complex model typically has higher capacity and conversely. However, for a collection of models the relationship between complexity and capacity is more varied. One can imagine a family of models containing a small number of highly complex models, e.g., a tree, a neural net, and a linear regression model, having high capacity individually but giving a family that has low complexity. The opposite can occur too: A family of models may contain a large number of models e.g., many straight line regressions, having low capacity individually but giving a family that has high complexity. Indeed, complexity is a sufficiently ill-defined term that it can be interpreted in an algorithmic sense (Kolmogorov complexity), an information theoretic sense (code length), a decision theoretic sense (mean squared error as in complexity regularization) as well as in a dimensional sense as in Vapnik-Chervonenkis dimension. The various formalizations of complexity are all intended to get at the notion of how many moving parts there are in a model or model class. Last chapter the focus was on techniques that are parsimonious i.e., add complexity sparingly as sample size increases. This chapter will be focused on assessing complexity in its own right as a characteristic feature of an inference procedure. Since decision theory is central to most of the methods to be presented, a short discussion is worthwhile. It is important to realize from the outset that parsimony is not the goal here so the regularizations in the risks in Chapter 6, Sec. 1 do not occur very often. It is the various forms of the actual empirical risk that matter most. In this context, recall the binary classification loss (4.10) and its generalization error (4.11). Write f to mean the ideal optimizer of the risk R as in (4.12). Let the empirical risk ˆR n (f) be minimized by ˆf as in (4.14,15). It will be seen that the subsection on Prediction Bounds for Function Classes is a preliminary version of the main points in this Chapter. Given this, it is clear that for typical loss functions, the risk functional R( ˆf) { = E L( ˆf(x) } f(x)) does not have a closed form expression. (For the classification case, L( ˆf(x) f(x)) = I ˆf(x) f(x) 0.) This can be dealt with by decomposing the global error as ( ) ( ) (E[R(f n )] R ) = E[R(f n )] inf }{{} R(f) + inf R(f) R f F f F }{{}}{{} (7.1) Error Estimation error Approximation error where f n is the function estimator based on n data points, R = R(f ) is the optimal risk for the problem at hand, and F is the function space being searched. The goal is to relate the empirical risk ˆR n ( ˆf) to the theoretical risk R, and the other forms of the risk can be regarded merely as intermediate steps. Using ˆf n in (7.1) gives ER( ˆf [ n ) R = ER( ˆf ] n ) inf f F R(f) + [inf f F R(f) R ]. (7.2)

3 3 in which the first term is the estimation error indicating how close the optimal estimate gets to the optimal risk and the second term is the approximation error indicating how well the minimal risk of the model class F can uncover the optimal risk. If lim k inf f Fk R(f) = R, then the first term on the right in (7.2) goes to zero for a suitable choice of k = k(n). That is, a rich enough collection of estimators ensures that the optimal risk can be achieved. This procedure is called Structural Risk Minimization because the sequence of F k s is called a structure and amounts to amounts to the method of sieves in which one seeks ˆf n = argmin f Fn ˆRn (f) necessitating the specification of F n and n pre-experimentally. An alternative way to think of (7.1) is from van de Geer (2004). Again, suppose f minimizes the theoretical risk and ˆf n minimizes the empirical risk. Then: This gives the decomposition R( ˆf n ) R(f ) 0 and R n ( ˆf n ) R n (f ) 0. 0 R( ˆf n ) R(f ) = [R n ( ˆf n ) R( ˆf n ) R n (f ) + R(f )] + R n ( ˆf n ) R n (f ) [R n ( ˆf n ) R( ˆf n ) R n (f ) + R(f )] = [ν n ( ˆf n ) ν n (f )]/ n, in which ν n (f) = n(r n (f) R(f)). Subtracting R(f ) R(f true ) from both sides gives 0 R( ˆf n ) R(f true ) [ν n ( ˆf n ) ν n (f )]/ n + [R(f ) R(f true )]. (7.3) Since the first component on the right is random, it can be bounded in principle using empirical process theory. (An empirical process is a sequence of random variables that are functions of the sample, as the sample size grows. Usually, the empirical process is indexed by an unknown function. Empirical process theory is typically used to study the estimation error of an estimator and is intended to give probability or moment bounds for the suprema (over a collection of functions) of an empirical process, see (4.17) for a simple case. An instance of this is the probability bounds on the uniform laws of large numbers on the empirical risk to be seen in the next section. These are often called concentration inequalities because they control the concentration of a random variable around its mean.) The second term on the right is a variant on the second term in (7.1) but represents a different approximation error, the distance from the optimum to the true model rather than the distance from an intermediate optimum to the overall optimum f. Typically, one has to take the expectation on the LHS of the last bound and look at ER( ˆf n ) R(f true ) in two cases: the well-specified case, when f true F and the mis-specified case, when f true / F. Bounds for the latter include a bias term to account for the mis-specification in addition to a variance term to indicate how fast

4 4 Chapter 1. Model Complexity the convergence occurs, see Van de Geer (2004, Sec. 2.5). Expression (7.3) is often called a Basic Inequality because it is the usual starting point for deriving bounds on how much the risk of ˆf n exceeds the optimal risk. Is it reasonable to regard risk as an assessment of complexity? The answer appears to be yes because the risk over a class F is often exquisitely sensitive to class itself. Indeed, it will be seen in general that an index of complexity called the Vapnik- Chervonenkis dimension of a function class such as F can be meaningfully defined and the risk over class F can be bounded by expressions involving the VC dimension. The VC dimension can reasonably be regarded as a dimension since it generalizes the usual notion of real dimension and has a geometric interpretation (shattering) in real spaces that corresponds to an intuitive property of real dimension. Thus, it is reasonable to regard the VC dimension as an index of complexity as well. Since the risk can be bounded by the VC dimension the risk can plausibly be regarded as another measure of complexity. Apart from bounding the risk by the VC dimension, the hypersurface the risk defines as a real valued function of its arguments almost (but not really) provides a topology for function spaces. This means that if a true function is fixed, then the risk can be used as an evaluation of how close a second function is to it. That is, as the second function varies, perhaps becoming more complicated by the addition of extra terms or by allowing more irregular behavior, the risk will detect the changes, i.e., respond numerically to the qualitative changes, usually by increasing. Clearly, how to measure complexity depends on the task. As noted, dimension is one way to assess a model or collection of models and the first section will focus on the VC dimension and its implications. The central role of VC dimension in controlling risk is often called empirical risk minimization, ERM. Given that risks can be bounded by VC dimension, it is possible to choose a model class by a procedure called structural risk minimization, SRM. The structure is a sequence of function classes or model spaces and the goal is to find the function class in the sequence that achieves the best overall risk. Thus, SRM is a complexity based model selection principle. A different concept of complexity arises in Probably Approximate Correct learning. This setting is based on classification and insists that regardless of the true distribution a procedure should be able to identify the set on which a function is one to arbitrary exactitude with high probability with a suitably large but fixed sample size. This too is formally expressed in terms of risks. Finite sets are PAC-learnable and infinite sets are PAC-learnable from a fixed sample size only if their VC dimension is finite. Again, the VC dimension is the measure of complexity, but this time for learning a set from data rather than a model. Oracle inequalities can be regarded as an assessment of complexity as well. Recall, in Chap. 6 it was seen that an Oracle inequality is a way to bound the error of a procedure, in which some tuning parameters are not known, by a factor times the ideal error that would occur if the tuning parameters were known; this ideal error is usually the smallest possible. Typically, the error of a procedure is a form of risk. Note that not knowing the value of a tuning parameter means it must be estimated, thereby making an inference procedure more complex. Thus, an Oracle inequality is a sense in which the extra complexity from estimating components in a model is

5 1.1. VC Dimension: Definition, Properties, Bounds 5 not unreasonable compared with the full knowledge ideal. A fourth notion of complexity may be the most familiar: Description length for a data set can be used as a measure of complexity. The longer it takes to describe a data set or anything else, such as a model class presumably the more information there is in it and hence the more complex it is. This is formalized in information theory and minimum complexity criteria can be used to estimate parameters, select models or priors, and make predictions. Indeed, the popular Akaike and Bayes information criteria arise from information theoretic reasoning. Although not obvious, minimal description lengths too are risk based because they emerge from the empirical minimization of a Bayes risk using entropy loss, see Barron and Cover (1990), Rissanen (1996). 1.1 VC Dimension: Definition, Properties, Bounds The point of VC dimension is to assign a notion of dimensionality to collections of functions that do not necessarily have a linear structure. It often reduces to the usual real notion of independence but not always. The issue is that just as dimension in real vector spaces represents the portion of a space a set of vectors can express, VC dimension for sets of functions rests on what geometric properties the functions can express in terms of classification. In DMML, the VC dimension helps set bounds on the performance capability of procedures. Consider a training sample {(y i, x i ) : i = 1,..., n} the task is to predict a future Y value from a new X value. The usual procedure is to choose a family of models F = {f(x, θ) : θ Θ} with f(, θ) : IR p IR for each θ and use the training sample to find a value ˆθ that gives good performance. As seen in earlier chapters, this structure holds in both regression and classification: In multiple linear regression, F consists of all p-dimensional hyperflats and the θ s are regression coefficients. In two-class linear discriminant analysis, F consists of all p-dimensional hyperflats, the θ s come from a Mahalanobis distance rule. As in Chapter 4, a loss function L : A IR IR, written as L[f(x new, ˆθ), y new ], can be used to measure the deviation between a predicted value f(x new, ˆθ) and a new value y new. In this notation, A is a collection of actions, the possible choices that can be made depending on what information is available. Making a prediction about a future value or giving an estimate of a parameter are typical actions. Standard loss functions include: absolute loss: L[f(x, θ), y] = f(x, θ) y squared error loss: L[f(x, θ), y] = [f(x, θ) y] 2. L[f(x, θ), y] = I[ yf(x, θ) > 0]. The first two are common in regression settings because they express the idea that the further wrong a prediction is, in a metric sense, the higher the cost should be. The third is appropriate for binary classification, labeled so that y { 1, 1} so that the indicator function that is 1 iff the sign of y is different from the sign of f(x, ˆθ). (The usual rule in classification is to predict 1 or -1 according to the sign

6 6 Chapter 1. Model Complexity of the classification rule f(x, ˆθ); the minus sign in the indicator means that when f gives the right classification the indicator function is 0.) The risk provides an evaluation of the performance of any model in F. Assume that the joint distribution of future values of (Y, X) has density p(y, x). Then the risk, or expected loss in a future prediction, is R(θ) = L[f(x, θ), y] p(y, x)dydx. (7.4) IR p IR This cannot be calculated without knowing p(y, x), which is typically unknown, so minimizing (7.4) is typically impossible. To get around this, the empirical risk, based on the sample, can be used. It is ˆR n (θ) = 1 n L[f(x i, θ), y i ]. (7.5) The empirical risk is a sample mean and so is subject to LLN s and CLT s. The goal here, however, is to find a bound B on the risk so that with high probability, R(θ) ˆR n (θ) + B. (7.6) This will indicate how bad the empirical risk is as an estimate of the true risk. If the bound is not too loose, then, in principle, instead of seeking the minimum risk action it would be enough to minimize the bound ˆR n (θ) + B in place of minimizing the risk or the empirical risk by itself. This procedure is called Empirical Risk Minimization, ERM. In general, consistency of ERM solutions for the actual minimum risk solutions will be essential. Consistency can be expressed independently of a specific family F as inf α Λ(c) ˆRn (α) inf α Λ(c) R(α) (7.7a) in probability as n for any F (z), where { } Λ(c) = α Q(z, α)df (z) c. (7.7b) In this definition, F is the joint distribution for z that represents (Y, X), and Q(z, α) is a collection of functions, with generic index α. The index α can represent a parameter θ from a class of models f, but in general is not related to a model. For the binary classification case with the family {f(x, θ)}, Q(z, α) = L[f(x, θ), y] = I[ yf(x, θ) > 0], in which case Q α (z) is the indicator function for a set C α, so that the role of the family F is only implicit in the notation. However, the Q(z, α) s are being regarded as a collection of random variables in Z = (Y, X). Expression (7.7a) requires convergence to the minimum for the set of functions {Q α } and for all subsets of those functions obtained by removing the ones with risk less than c. It will be seen in below that, for functions {Q α }, if a Q(z, α) A (7.8)

7 1.1. VC Dimension: Definition, Properties, Bounds 7 then: I) the ERM method is consistent for F, the distribution of Z, and the functions Q(z, α) for α Λ and II) uniform 1-sided convergence of the means of the Q(z, α) s to their expectation occurs under F. In fact, the ERM is consistent in a uniform 2 sided sense. While the ERM is quite general, focus will be mostly on the classification setting with 0 1 loss because the results are clearest. This is the setting in which the Q α s are indicator functions for sets defined by F with A = 1, a = 0. To see what B in (7.6) might look like, recall (4.20). The Theorem there (stated somewhat informally) can be improved to an expression for the ERM bound for a general collection of functions F satisfying (7.8). Vapnik (1998, Sec. 5.3) shows that with probability 1 η the inequality R(α) ˆR n (θ) + (A a) E (7.9) holds for all α i.e.,, where E(n, η, h)) = (4/n)[h(ln(2n/h) + 1) ln(η/4)] and h is a non-negative integer called the Vapnik-Cervonenkis (VC) dimension which is a property of F alone. Clearly, if one has a sequence of function spaces F k, then the bound (7.9) can be applied to each of them and it is a matter of which k to choose, and the VC dimension obviously plays the dominant role. Note that (7.9) does not depend on the distribution of Z or (Y, X). So, it can be actually used to find near-optimal solutions. Aside from the finiteness of h, the validity of the ERM bound only requires the sample be randomly drawn from F. It can be seen that if h is known, then it will usually be easier to minimize the ERM bound instead of minimizing the risk directly. On the other hand, it is easy to find F s with h is so large that the bound is greater than A, in which case (7.9) is useless. Moreover, in the absence of (7.8), i.e., for a more general real valued function Q obtained for a loss function other than 0-1 loss, the bounds available under (7.8) do not appear to generalize fully. The basic principles generalize well, and there are versions of (7.9) for a variety of important cases, but there are cases where specific bounds, rates, and properties for the overall method either have not been developed or do not exist. The bulk of this section will be devoted to explications of expressions like (7.9) and the consequences for inference in DMML. This begins with the definitions needed to understand the VC dimension h and continues with the hard work of understanding its properties and how they arise in bounds on the risk VC Dimension of F: Definitions There are no less than three ways to approach defining the VC dimension. The most accessible is geometric, based on the idea of shattering a set of points. Shattering means that a collection of functions, F, regarded as curves in a surface, or hypersurfaces in a space, can be used to partition a set of n points in all 2 n ways. The VC dimension is the largest number of points that ca be shattered by F. A separate approach is essentially combinatorial. The idea of shattering a point set by the use of functions is converted into a process of picking out points in a point set by the use of sets in a given collection of sets. The VC dimension arises as the exponent in a polynomial bound (whose existence is guaranteed by a result called

8 8 Chapter 1. Model Complexity Sauer s Lemma) on how many ways a set of n elements can be partitioned by using a collection of sets to pick out subsets of them. The third way to define VC dimension is more technical. Undergirding much of the mathematics is the concept of a covering number. Given a set in a metric space, an ɛ-net is a collection of elements of the metric space so that each element in the set is at most ɛ away from some element in the collection. To be useful, an ɛ-net should be finite. The minimal cardinality of the ɛ-nets (for a fixed ɛ) for a certain set of vectors defined from a set F of, say, indicator functions, evaluated on n points, gives the covering number, for that choice of ɛ and n. The covering number is, of course, related to F and the number of different separations of collections of n points. The logarithm of the supremum of the (minimal) covering number over sets of n points is called the growth function. As a function of n it increases, but it can be proved the qualitative behavior of the function changes from linear to logarithmic at a certain integer. This integer is, again, the VC dimension. The growth funcion represents how the information gain per extra point from a class of functions increases: Past a certain point, given by h, the rate of information gain slows dramatically. Geometric Since the VC dimension h of a class of functions F depends on how they separate points, start by considering the two-class discrimination problem with a family F indexed by θ, say f(x, θ) { 1, 1}. Given a set of n points, there are 2 n subsets that can be regarded as arising from labeling the n points in all 2 n possible ways with 0, 1. Now, fix any one such labeling and suppose there is a θ so that f(x i, θ) assigns 1 when x i has the label 1 and -1 when x i has the label 0. This means that f(, θ) is a member of F that correctly assigns the labels. If for each of the 2 n labelings, there is a member of F that can correctly assign those labels to the n points then the set of points is shattered by F. The VC dimension for F is the maximum number of points that can be shattered by the elements of F. It is convenient to think of the points as data points and n is the sample size. Now, the VC dimension of a set of indicator functions I α (z), generated by say F, where α Λ indexes the domains on which I = 1, is the largest number h of points that can be separated into two different classes in all 2 n possible ways using that set of indicator functions. If there are n distinct points z 1,..., z n (in any configuration) in a fixed space that can be separated in all 2 n possible ways then the VC dimension h is at least n. That is, it is enough to shatter one set of n vectors to show the dimension is at least n. If, for every value of n, there is a set of n vectors that can be shattered by the I(z, α) s then F has VC dimension infinity. So, to find the VC dimension of a collection of functions on a real space, one can test each n = 1, 2, 3... to find the first value of n for which there is a labeling that cannot be replicated by the functions. It is important to note that the definition of shattering is phrased in terms of all possible labelings of n vectors as represented by the support of the indicator functions which is in some fixed space. So, the VC dimension is for the set F whose elements define the supports of the indicator functions, not the space itself. In a sense, the VC-dimension is not of F itself so much as the level sets defined by F

9 1.1. VC Dimension: Definition, Properties, Bounds 9 since they generate the indicator functions. Indeed, the definition of VC dimension for a general set of functions F, not necessarily indicator functions, is obtained from the indicator functions from the level sets of F = {f α ( ) : α Λ}. Let f α F be a real valued function. Then the set of functions I {z:fα(z) β 0}(z) for α Λ, β ( inf z,α f α(z), sup z,α ) f α (z) (7.10) is the complete set of indicators for F. The VC dimension of F is then the maximal number h of vectors z 1,...,z h that can be shattered by the complete set of indicators of F, for which the earlier definition applies. The geometric definition is often the easiest to use for examples, even though the other definitions are often easier for proving theorems. The intuition behind shattering will be clearer shortly, after a few examples are presented to see what it means in typical cases. It will also be seen later that when F consists of indicators so that (7.8) holds, results are much easier to prove and cleaner to state than for the case of a general class of real valued functions. Combinatorial The definition of VC dimension can built up from picking out subsets of a set, a concept related to shattering. This gives the VC dimension of a set which can then be extended to function classes. This was the original way that Vapnik and Chervonenkis (1971) expressed the idea. Let C be a collection of subsets of a space X and let {x 1,..., x n } be a set of n points in X and let C {x 1,..., x n } for C C be the collection of sets of {x 1,..., x n } picked out by C. Similar to before, the collection C shatters {x 1,..., x n } if and only if all 2 n of its subsets can be picked out by C, and the VC dimension h of the class C of sets is the smallest n for which no set of size n is shattered by C. Formally, van der Vaart and Wellner (1996) define n (C, x 1,..., x n ) = # {{x 1,..., x n } C : C C}, (7.11a) so that the VC dimension is { } h = inf n : max {x n(c, x 1,..., x n ) < 2 n. (7.11b) 1,...,x n} The infimum over an empty set can be defined to be zero, so the VC dimension of a set is only infinity when C shatters arbitrarily large sets. Picking out points from a set is much like checking which indicator functions are 1 and inferring the location of a point by examining their supports. Essentially, the combinatorial interpretation of VC dimension is an effort to formalize the idea that finitely many data points only permit us to distinguish finitely many events. Intuitively, the idea is to use the data to form clusters of events. Consider the equivalence relation defined by saying that two events are in the same equivalence class of events if and only if they pick out the same data points. Then, the number of equivalence classes depends on the data set and the set of functions F. For fixed n, this is heuristically like finding a data driven sub-σ-field from the σ field generated by C.

10 10 Chapter 1. Model Complexity Let f F, a class of functions. The subgraph of f : X IR is {(x, t) : t < f(x)} (7.12) and the collection F of measurable functions on a sample space is called a VC class if the collection of all subgraphs for the f s in F forms a class of sets with finite VC dimension. It is not too hard to see that (7.12) is equivalent to the geometric definition when F is a collection of indicator functions: Draw the subgraph for a generic indicator function and observe that a set of points can be shattered in the geometric sense by indicator functions if and only if the corresponding subgraphs of the indicator functions shatter the set of points in the picking out definition with. More generally, results in van der Vaart and Wellner (1996, Sec ) suggest that (7.12) remains equivalent to (7.10): The indicators of the level sets for a function in (7.10) correspond to subsets of the subgraph of the function so the subgraph of a function appears to hold all the shattering properties in the level sets when the VC dimension is finite, and conversely. (Problem 10 and 11 in van der Vaart and Wellner 1996 show how to partition the subgraph of a function in some cases.) By definition, a VC class of sets picks out strictly less than 2 n subsets from any set of n h = V Cdim. Intuition from the behavior of continuous functions suggests that as the size of the point set a class of sets tries to shatter increases the fewer the number of sets the class can pick out. That is, if n = h + 1, then even though 2 n 1 sets could be picked out, the actual number is polynomial, not the exponential of a smaller exponent. This surprising fact is called Sauer s Lemma, see Sauer (1972). It is of interest in its own right and will be important in the more technical definition of VC dimension in the next subsection. Let the shatter function for a class C be τ(n) = max {x n(c, x 1,..., x n ), (7.12a) 1,...,x n} the largest number of subsets of the x i s one can get by intersection it with sets in C. If a set of size n is shattered by C then V Cdim(C) n and τ(n) = 2 n. If there is no such set then V Cdim(C) < n and τ(n) < 2 n. Theorem (Sauer s Lemma): If V Cdim(C) = h, then τ(n) n h. Proof: Following Ben-David (2003), see also van der Vaart and Wellner (1996, Sec. 2.6), if A is any set in X then the condition #{A C : C C} #{B A : C shatters B} implies Sauer s Lemma because V Cdim(C) = h implies that #{B A : C shatters B} h ( ) #(A) #(A) h. i i=0 (7.13a) (7.13b) Expression (7.13a) follows by an induction argument on #(A). For #(A) = 0 both sides of (7.13a) are 1. If (7.13a) is true for all sets smaller than A and collections of sets C, it is enough to prove (7.13a) for a fixed but arbitrary A and a general C.

11 1.1. VC Dimension: Definition, Properties, Bounds 11 Let x A and set A(x) = A {x}; A(x) is now smaller than A so the induction hypothesis can be applied to it. Now define three collections of sets. The first is the sets of A picked out by C: C(A) = {C A : C C}. The second is the sets of A(x) picked out by C: C(A(x)) = {C A(x) : C C}. The third is the collection of all subsets of A(x) that are in C(A) (and so do not contain x) but can be augmented by x to give another set in C(A): The first step is to show C(A(x), aug) = {B A(x) : B and B {x} C(A)}. #(C(A)) = #(C(A(x))) + #(C(A(x), aug)) : (7.14) Let B A(x). Then there are four possibilities for B and B x in the definition of C(A(x), aug). First, if B C(A) and B {x} C(A) then B C(A(x)) and in C(A(x), aug) while B and B {x} are 2 sets that contribute to C(A). So, both sides of (7.14) are affected equally. Second, if B C(A) but B {x} / C(A), then B C(A(x)) but not in C(A(x), aug). So, it contributes one set the right hand side of (7.14) and one set to the second term on the right side of (7.14). Third, if B / C(A) but B {x} C(A), then B {x} contributes one to the LHS of (7.14). However, B {x} = Cx A for some Cx C. Therefore, B {x} = Cx A = (Cx [A {x}]) (Cx {x}) = Cx A(x) φ C(A(x)) which means it contributes one set to the first term on the right in (7.14). Finally, if B / C(A) but B {x} C(A), the no set is contributed to any of the terms in (7.14), exhausting the possibilities and establishing (7.14). The second, and last, step in the proof is to apply the induction hypotheses to both sides of (7.14) and obtain (7.13a). Note that both collections of sets on the right in (7.14) contain only subsets of A(x) making C A(x) = C. So, the two terms can be bounded as follows. First, #(C(A(x))) = #{C A(x) : C C(A(x))} #{B A(x) : C(A(x)) shatters B} = #{B A {x} : C shatters B} = #{B A : x / B and C shatters B}. (7.15a) Similarly, #(C(A(x)), aug) = #{C A(x) : C C(A(x), aug)} #{B A(x) : C(A(x), aug) shatters B} = #{B A {x} : C shatters B {x}} = #{B A : x B and C shatters B}. (7.15b)

12 12 Chapter 1. Model Complexity Using (7.15a,b) in (7.14) gives #(C(A)) = #(C(A(x))) + #(C(A(x), aug)) #{B A : C shatters B}. An extension of Sauer s Lemma, van der Vaart and Wellner (1996), states that Entropy Numbers h 1 ( ) ( ) h 1 n ne τ(n). (7.13c) j h 1 j=0 The goal is to assign a class of functions F a measure that generalizes the concept of real dimension so that a bound like (7.6) or (7.9) can be given. This will necessitate defining covering numbers, their finite sample analogs called random entropy numbers, and then generalizations of entropy numbers called the annealed entropy and the growth function. The VC dimension will emerge from a key shape property of the growth function. This subsection is largely a revamping of parts of Vapnik (1998, Chap. 3). To start, recall the simplest measure of the size of a set is its cardinality, but since F is typically uncountable that s no help. What is a help is the concept of a covering number since it is also a measure of size. Since F is a subset of a normed space with norm, given an ɛ one can measure F s size by the minimal number of balls of radius ɛ required to cover F. This is called the covering number N(ɛ, F, ). The ɛ-entropy of F, or the Boltzmann entropy, is ln N(ɛ, F, ), its natural log. The covering number of a class of indicator functions F is related to the number of subsets that can be picked out from {x 1,..., x n } by the functions in F. Indeed, write F = {1 C : C C} so that C is the class of sets whose indicator functions are in F. Now, a complicated argument based on Sauer s Lemma gives a bound on the covering number for F. The result is that there is a universal constant K so that for 1 > ɛ > 0 and r 1, if h is the (geometric) VC dimension of C then N(ɛ, F, L r (µ)) = N(ɛ, C, L r (µ)) Kh(4e)h ɛ r(h 1) where L r is the Lebesgue space with r-th power norm with respect to µ, see van der Vaart and Wellner (1996, Chap. 2.6). The more typical definition of the entropy, also called the Shannon entropy, is minus the expected log of the density; this is the expectation of a sort of random entropy. The log density plays a role similar to the Boltzmann entropy because both represent a codelength. The log density is the codelength for identifying which X has occurred; the log ɛ-entropy is the codelength for the number of configurations possible in a thermodynamic system. They are equal for uniform densities and, more generally, are related when the configurations can be grouped together to make a non-uniform distribution. The fourth section in this chapter will treat the Shannon entropy in detail. To see informally how an ɛ-entropy might arise in the ERM, observe that a variant

13 1.1. VC Dimension: Definition, Properties, Bounds 13 on (7.7) is the uniform two-sided convergence condition ( ) P sup Q(z, α)df (z) 1 Q(z i, α) > ɛ 0, (7.16) α n as n and that when, for each α, the Q(, α) s are indicator functions for sets C α, the second term is just the empirical distribution from z 1,...,z n, say ˆF n, based on the functions in F. That is, (7.16) is ( ) P sup P ({Z Q(Z, α) > 0}) ˆF n ({Z Q(Z, α) > 0}) > ɛ α 0, (7.17) in which the probability P in (7.17) applies to the randomness in Z 1,...,Z n. For a single fixed α, Hoeffding s inequality gives ( P P ({Z Q(Z, α) > 0}) ˆF ) n ({Z Q(Z, α) > 0}) > ɛ 2e 2nɛ2, (7.18) so the challenge is how to obtain uniformity over general sets of α s i.e., for the whole set of events {z Q(z, α) > 0} where α Λ. As noted in Vapnik (1998), the case of finitely many α s in (7.17) can be handled as well. For α 1,...,α K, ( ) P max P ({Z Q(Z, α k) > 0}) ˆF n ({Z Q(Z, α k ) > 0}) > ɛ 1 k K K P k=1 ( P ({Z Q(Z, α k ) > 0}) ˆF ) n ({Z Q(Z, α k ) > 0}) > ɛ 2Ke 2nɛ2 = 2e n(ln K/n 2ɛ2). (7.19) So, if K = K(n) and ln K/n 0, it is reasonable to expect that (7.16) will hold. Moreover, if K is regarded as a number of ɛ balls and (7.18) holds for a small neighborhood around each α at the centers of the ɛ balls, then if the ɛ is chosen small enough (7.19) should hold for α Λ provided the function space is not too large in a VC dimension sense. That is, the ɛ-entropy is analogous to the ln K in (7.19). This is a metric variant of the usual compactness argument that extracts a finite open cover from a collection of balls each of which has a certain desirable property. Here, however, covering F or Λ to control the convergence can be done in several different senses and one of them will give a definition of the VC dimension. While the relationships among the ɛ-entropy of F, the VC dimension of F, and other quantities are deep, all of these are population quantities. By contrast, for ERM, the importance is on their data driven analogs. So, given a finite sample, it is important to have an analog of covering number. This is provided by counting the number of sets a sample of size n can distinguish. Define two events as distinguishable using a sample z 1,..., z n, if and only if there is at least one z i that belongs to one event and not the other. It is immediate that not all sets in a large class C are distinguishable

14 14 Chapter 1. Model Complexity given a sample; different samples can have different collections of distinguishable sets; and the number of distinguishable sets can depend on the sample as well. The next task is to identify which sets it is important for a sample to be able to distinguish. For a set of indicator functions Q(z, α) with support sets C α and α Λ, consider the vector of length n q(α) = (Q(z 1, α),..., Q(z n, α)). (7.20) For fixed α, q(α) is a vertex of the unit cube in IR n. As α ranges over Λ, q(α) hops from vertex to vertex. Let N(Λ, z 1,..., z n ) be the number of vertices q(α) lands on. Then, N(Λ, z 1,..., z n ) is the number of sets the sample z 1,...,z n can distinguish. Actually, N(Λ, z 1,..., z n ) is the same as (C, z 1,..., z n ) if C corresponds to the collection of sets that can be the supports of the indicator functions Q α. Let the Z i be IID P. Then, N(Λ, Z 1,..., Z n ) is a random variable bounded by 2 n for given n. The random entropy of the set F of indicator functions Q(, α) for α Λ is ln N(Λ, z 1,..., z n ) and the entropy of F is H(Λ, n) = ln N(Λ, z 1,..., z n )df (z 1,..., z n ). (7.21) The idea will be to use N(Λ, z 1,..., z n ) or H(Λ, n) in place of K in (7.19). It will be possible, in the next section, to show that if H(Λ, n) increases slowly compared to n that uniform convergence over F can be obtained. The entropy of real valued functions in general is similar to the indicator function case, but uses the concept of a covering number more delicately. First consider a set of bounded real valued functions F C defined by α Λ Q(z, α) C. Again q(α, z 1,..., z n ) can be used to generate a region R in the n dimensional cube of sidelength 2C. The random covering number of R for a given ɛ is N(Λ, ɛ, z 1,..., z n ) giving ln N(Λ, ɛ, z 1,..., z n ) as the random ɛ-entropy. (To define the covering number, the norm must be specified as well as the ɛ and region to be covered. In this case, the natural choice is the metric ρ(q(α), q(α )) = max,...,n Q(z i, α) Q(z i, α ).) The ɛ-entropy (not random) for F C is H(Λ, ɛ, n) = H(Λ, ɛ, z 1,..., z n )df (z 1,..., z n ). (7.21) Note that if F C is consists of indicator functions then for ɛ (0, 1), N(Λ, ɛ, z 1,..., z n ) is independent of ɛ and N(Λ, ɛ, z 1,..., z n ) = N(Λ, z 1,..., z n ) so the definition of the random covering number for bounded functions reduces to that for indicator functions. This holds for the random entropy too, and gives H(Λ, ɛ, n) = H(Λ, n). For a general class F of functions, it is unclear how to define the entropy. However, here, it will not be needed. It will be enough to impose the moment condition < a Q(z, α)df (z) A < (7.22)

15 1.1. VC Dimension: Definition, Properties, Bounds 15 for all α Λ, the envelope condition that sup Q(z, α) df ((z) <, (7.23) α Λ and the truncation condition that the F C formed by truncating f F from above at C and from below at C has finite entropy. Equipped with the entropy, it is easy to specify two closely related quantities that will arise in the bounds to be established that involve the VC dimensions. First, the annealed entropy is H ann (Λ, n) = ln EN(Λ, Z 1,..., Z n ) (7.24a) and, second, the growth function is G(Λ, n) = ln sup z1,...,z n N(Λ, z 1,..., z n ). (7.24b) Clearly, H(Λ, n) H ann (Λ, n) G(Λ, n), the result of Jensen s inequality on the entropy H(Λ, n) and taking the supremum of the integrand. The growth function does not depend on the probability measure; it is purely data driven. So, an ERM bound using G(Λ, n) may be preferred over entropy or annealed entropy bounds. Clearly, H ann (Λ, n) G(Λ, n) implies that any bound using entropy gives a bound in terms of growth. On the other hand, the three conditions lim n H(Λ, n)/n = 0, lim n H ann (Λ, n)/n = 0, and lim n G(Λ, n)/n = 0 are increasingly stringent and will be seen to give greater and greater control on bounds of the form (7.19) that undergird ERM. The key theorem linking VC-dimension to a collection of indicator functions is the following. It provides an interpretation for h separate from its definition in terms of shattering or (7.11b). Although h arises in the argument by using the geometric definition of shattering, in principle, there is no reason not to use the result of this Theorem as a third definition for h. Let C(n, i) be the number of combinations of i items from n items. Theorem (Vapnik, 1998): The growth function G(Λ, n) for a set of indicator functions Q(z, α), α Λ is linear-logarithmic in the sense that { = n ln 2 if n h; G(Λ, n) ln( h i=0 C(n, i) h ln(en/h) = h(1 + ln(n/h)) if n > h. Comment: If h = then the second case in the bound never applies so G(Λ, n) = n ln 2. The essence of the result is that if a set of functions has finite VC dimension, then its growth function is initially linear, meaning that the amount of learning per datum accumulated is proportional to the sample size, up to the VC dimension, after which the learning per datum drops off to a log rate. Outline of Proof: The strategy is to examine the relationship between N(Λ, z 1,..., z n ) and sums of combinations for they will lead to the expression in the theorem. This is characterized by three Facts whose (nontrivial) proofs are omitted. Fact 1: Suppose z 1,...,z n satisfies N(Λ, z 1,..., z n ) > m 1 i=0 C(n, i) for some m. Then there is a subsequence z1,..., zm of length m so that N(Λ, z1,..., zm) = 2 m.

16 16 Chapter 1. Model Complexity The proof is an elaborate induction argument on n and m. Fact 2: Suppose that for some m. Then, n > m we have that where Φ(m, n) = m 1 i=0 C(n, i). Fact 3: For n > m, we have that sup z1,...,z m+1 N(Λ, z 1,..., z m+1 ) 2 m+1 sup z1,...,z n N(z 1,..., z n ) Φ(m + 1, n) nm 1 Φ(m, n) < 1.5 (m 1)! < ( ) m 1 en. m 1 Now, put the three facts together. Since sup z1,...,z m+1 N(Λ, z 1,..., z m+1 ) 2 m+1, whenever there is get strict inequality Fact 2 can be applied. Thus, as soon as there is any compression in the sense that fewer than the maximum number of sets are needed to cover the vectors q(α), Facts 2 and 3 can be combined to get ( sup z1,...,z n N(Λ, z 1,..., z n ) Φ(m + 1, n) < 1.5 nm en ) m (m)! <. m Taking logs gives the logarithmic part of the bound in terms of m. That is, as soon as the number of sets required to cover the q(α) s falls below the maximal number of sets, as defined by the partitions of the data z1 n generated from F by way of indicator functions L(f α, y) = Q(z, α), the logarithmic term appears. Note that once the reduction has occurred for one n, it occurs for every n afterward. The logarithmic bound begins to apply for the smallest m permitting the reduction and this arises when the functions in F can no longer shatter the data, i.e., two or more of the data points cannot be separated by indicator functions defined by the functions in F. This Theorem can be restated in Sauer s Lemma terms as the following. In this sense, the combinatorial definition of h, see (7.11b), leads to the same interpretation of h as in the last Theorem: h is the changepoint where the rate of increase of a function changes from linear in n to logarithmic in n. Note that n plays the same role as the growth function; they differ mostly by the use of the logarithm. Theorem: Let C be a class of subsets. Then, either or n (C, z 1,..., z n ) sup n (C, z 1,..., z n ) = 2 n z 1,...,z n { = 2 n if n h; ( h i=0 C(n, i) (en/h)h if n > h, where h is the last integer n for which the equality in the first case is valid. Proof: This is a restatement of the last Theorem using the Facts stated there.

17 1.1. VC Dimension: Definition, Properties, Bounds 17 Note that the combinatorial definition of h is implicit in the use of n, but that the bound can be taken as a definition of h. This means the bound gives a definition of h which is de facto equivalent to the combinatorial definition which was already seen to be equivalent to the geometric definition by shattering. For completeness, it is important to give explicit definitions for the annealed entropy and growth function for a general set of real valued functions; this is the case that is most likely to appear. The entropy and VC dimension have already been defined for this case, so recall that a set of indicator functions Q α can be used to define the entropy as in (7.20) and (7.21) and suppose the Q α s arise from a collection of real valued functions f α as in (7.10). Let N β (Λ, z n 1 ) be the number of different separations of n vectors by the complete set of indicators for the Q(z, α) s; the subscript β indicates the use of the level sets in (7.10). The random entropy of the set of real valued functions Q(z, α) is now H β (Λ, z n 1 ) = ln N β (Λ, z n 1 ) and the annealed entropy of the Q(z, α) s is H ann,β (Λ, n) = ln EN β (Λ, z n 1 ). (7.25a) The growth function of the Q(z, α) s is G β (Λ, n) = ln max N β (Λ, z n z1 n 1 ) (7.25b) and the VC-dimension of the real valued functions Q(z, α) for α Λ is the largest number h of vectors z1 n that can be shattered by the complete set of of indicators I(Q(z, α) β) for α Λ and β in the interval in (7.10). This is the same as before, except that the dependence on the level sets of the f β s has been indicated. Indeed, as with (7.21), (72.4a,b) H ann,β (Λ, n) G β (Λ, n) h(ln(n/h) + 1) Examples To get a sense for how the definition of shattering leads to a concept of dimension, it s worth seeing that VC dimension often reduces to simple expression that are minor modifications of the conventional dimension in real spaces. An amusing first observation is that the collection of indicator functions on IR with support (, a] for a IR has VC dimension 2 because it cannot pick out the larger of two points x 1 and x 2. However, the collection of indicator functions on IR with support (a, b] for a, b IR has VC dimension 3 because it cannot pick out the largest and smallest of three points x 1, x 2, and x 3. The natural extensions of these sets in IR n have VC dimension d + 1 and 2d + 1. Now, consider planes through the origin in IR n. That is, let F be the set of functions of the form f θ (x) = θ x = n θ ix i for θ = (θ 1,..., θ n ) and xx = (x 1,..., x n ). The task is to determine the highest number of points that can be shattered by F. It will be seen that the answer depends on the range of x. First, suppose that x varies over all of IR n. Then the VC dimension of F is n + 1. To see this, recall that the shattering definition requires thinking in terms of partitioning points sets by indicator functions. So, associate to any f θ the indicator function I {x:fθ (x)>0}(x) which is 1 when f θ > 0 and zero otherwise. This is the same

18 18 Chapter 1. Model Complexity as saying the points on one side of the hyperplane f θ (x) 0 are coded 1 and the others 0. (A minus sign gives the reverse, 0 and 1.) Now ask: How many points in R n must accumulate before they can no longer be partitioned in all possible ways? More formally, if there are k points, how large must k be before the number of ways the points can be partitioned by indicator functions I {x:fθ (x)>0}(x) falls below 2 k? Sauer s Lemma and the last theorem guarantee that such k exists; the question is to find it for a given n. One way to proceed is to start with n = 2 and test values of k. So, consider k = 1 point in IR 2. There are 2 ways to label the point, 0 and 1, and the two cases are symmetric. The class of indicator functions obtained from F is I {x:fθ (x)>0}(x). Given any labeling of the point by 0 or 1, any f F gives a labeling and f gives the other. So, the VC dimension is at least 1. Next, consider 2 points in IR 2 : There are four ways to label the two points with 0 and 1. Suppose the two points do not lie on a line through the origin unless one is the origin. It is easy to find one line through the origin so that both points are on the side of it that gets 1, or that gets zero. As long as the 2 points are not on a line through the origin (and are distinct from the origin) there will be a line through the origin so that one of the points is on the side of the line that gets 1 and the other will be on the side of the line that gets 0. So, there are infinitely many pairs of points that can be shattered. Picking one means the VC dimension is at least 2. Now, consider 3 points in IR 2. To get VCdim at least three, it is enough to find 3 points that can be shattered. If none of the points is the origin, typically they cannot be shattered. However, if one of the points is the origin, and the other two are not collinear with the origin then the three points can be shattered by the indicator functions. So, VC dimension is at least three. In fact, in this case, VCdim cannot be 4 or higher. There is no configuration of 4 points, even if one is at the origin, that can be shattered by planes through the origin; this can be seen by a modification of the proof of the next theorem. If n = 3, then the same kind of argument produces 4 points that can be shattered (one is at the origin) and, again, the next proof can be modified to show that 4 is the maximal number of points that can be shattered. Higher values of n are also covered by the theorem and establish that V Cdim(F) = n + 1. The real dimension of the class of indicator functions I {x:fθ (x)>0}(x) for f F is, however, n, not n+1. The discrepancy is cleared up by looking closely at the role of the origin. It is uniquely easy to separate from any other point because it is always on the boundary of the support of an indicator function. As a consequence, the linear independence of the components x i in x is indeterminate when the values of the x i s are zero. Thus, if the origin is removed from IR n so the new domain of the functions in F is IR n \ {0}, the VC dimension becomes n. To bring this out clearly in general, redefine F to be the set of indicator functions defined by F = {I P n θiφi(x) 0 : i = 1,..., n θ i IR} where the φ i s are a linearly independent collection of n functions. If there is an x 0 where all the φ i (x) s are zero, without loss of generality assume x 0 = 0 by translation. If all φ i (x) = x i, then F reduces to F which consisted of planes

Lecture 3: Introduction to Complexity Regularization

ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,