Model Complexity. Chapter 1

Size: px
Start display at page:

Download "Model Complexity. Chapter 1"

Transcription

1 Chapter 1 Model Complexity The intuitive notion of complexity has to do with combining a collection of parts to form a whole. Hopefully the parts are related as in terms in a model, subsets of a data set, or iterates of a procedure. Naturally, one doesn t want more complexity than necessary, Ockham s rule, but how much complexity is really needed is often unclear. Consider a variance bias tradeoff: A complex model may have many terms and hence small bias, but estimating the numerous coefficients may make the variance and hence MSE larger than for a simpler model. In these cases, parsimony is preferred over complexity. Or, more exactly, given the model class, the optimal amount of complexity can be regarded as a function of the bias and the sample size. The variance bias decomposition can be extended, at least in principle, to a more elaborate decomposition in which there are two bias terms to assess incorrect estimation within a model separately from the approximation error of the model class to the true model, and two variance terms, one for the variance of model identification within a class of models and one for parameter estimation once a model has been identified. It is easy to imagine even more elaborate decompositions by trying to separate bias due to model selection from bias due to parameter estimation, and biases and variances introduced from varying the model class from say splines to neural nets to additive models. Rather than looking at variance or bias per se, it is often reasonable to look at the capacity and fit of a model in a predictive context. Capacity refers to the ability of the model (or machine) to fit the training sample perfectly. If the model is so flexible that it can perform without error on the training sample, then it is fitting the noise as well as the signal (i.e., overfitting). In this case, capacity and fit are roughly analogous to variance and bias respectively. A model with too much capacity is like a lookup table giving back the data exactly and a model with too little capacity is like a constant function that ignores the data entirely. More fancifully, a model with too much capacity is like an officious bureaucrat who files each separate piece of paper in a different file folder even documents with titles like Procedure for Setting Security Alarm When Leaving Office and Procedure for Turning Off Security Alarm Upon Arriving at Office. A model with too little capacity is like a the other kind 1

2 2 Chapter 1. Model Complexity of bureaucrat who stores all documents in one vast vault labeled Work. The best predictive performance typically arises when the family of models strikes the right balance between the capacity of the family and its performance on future data. Capacity is related to complexity: A more complex model typically has higher capacity and conversely. However, for a collection of models the relationship between complexity and capacity is more varied. One can imagine a family of models containing a small number of highly complex models, e.g., a tree, a neural net, and a linear regression model, having high capacity individually but giving a family that has low complexity. The opposite can occur too: A family of models may contain a large number of models e.g., many straight line regressions, having low capacity individually but giving a family that has high complexity. Indeed, complexity is a sufficiently ill-defined term that it can be interpreted in an algorithmic sense (Kolmogorov complexity), an information theoretic sense (code length), a decision theoretic sense (mean squared error as in complexity regularization) as well as in a dimensional sense as in Vapnik-Chervonenkis dimension. The various formalizations of complexity are all intended to get at the notion of how many moving parts there are in a model or model class. Last chapter the focus was on techniques that are parsimonious i.e., add complexity sparingly as sample size increases. This chapter will be focused on assessing complexity in its own right as a characteristic feature of an inference procedure. Since decision theory is central to most of the methods to be presented, a short discussion is worthwhile. It is important to realize from the outset that parsimony is not the goal here so the regularizations in the risks in Chapter 6, Sec. 1 do not occur very often. It is the various forms of the actual empirical risk that matter most. In this context, recall the binary classification loss (4.10) and its generalization error (4.11). Write f to mean the ideal optimizer of the risk R as in (4.12). Let the empirical risk ˆR n (f) be minimized by ˆf as in (4.14,15). It will be seen that the subsection on Prediction Bounds for Function Classes is a preliminary version of the main points in this Chapter. Given this, it is clear that for typical loss functions, the risk functional R( ˆf) { = E L( ˆf(x) } f(x)) does not have a closed form expression. (For the classification case, L( ˆf(x) f(x)) = I ˆf(x) f(x) 0.) This can be dealt with by decomposing the global error as ( ) ( ) (E[R(f n )] R ) = E[R(f n )] inf }{{} R(f) + inf R(f) R f F f F }{{}}{{} (7.1) Error Estimation error Approximation error where f n is the function estimator based on n data points, R = R(f ) is the optimal risk for the problem at hand, and F is the function space being searched. The goal is to relate the empirical risk ˆR n ( ˆf) to the theoretical risk R, and the other forms of the risk can be regarded merely as intermediate steps. Using ˆf n in (7.1) gives ER( ˆf [ n ) R = ER( ˆf ] n ) inf f F R(f) + [inf f F R(f) R ]. (7.2)

3 3 in which the first term is the estimation error indicating how close the optimal estimate gets to the optimal risk and the second term is the approximation error indicating how well the minimal risk of the model class F can uncover the optimal risk. If lim k inf f Fk R(f) = R, then the first term on the right in (7.2) goes to zero for a suitable choice of k = k(n). That is, a rich enough collection of estimators ensures that the optimal risk can be achieved. This procedure is called Structural Risk Minimization because the sequence of F k s is called a structure and amounts to amounts to the method of sieves in which one seeks ˆf n = argmin f Fn ˆRn (f) necessitating the specification of F n and n pre-experimentally. An alternative way to think of (7.1) is from van de Geer (2004). Again, suppose f minimizes the theoretical risk and ˆf n minimizes the empirical risk. Then: This gives the decomposition R( ˆf n ) R(f ) 0 and R n ( ˆf n ) R n (f ) 0. 0 R( ˆf n ) R(f ) = [R n ( ˆf n ) R( ˆf n ) R n (f ) + R(f )] + R n ( ˆf n ) R n (f ) [R n ( ˆf n ) R( ˆf n ) R n (f ) + R(f )] = [ν n ( ˆf n ) ν n (f )]/ n, in which ν n (f) = n(r n (f) R(f)). Subtracting R(f ) R(f true ) from both sides gives 0 R( ˆf n ) R(f true ) [ν n ( ˆf n ) ν n (f )]/ n + [R(f ) R(f true )]. (7.3) Since the first component on the right is random, it can be bounded in principle using empirical process theory. (An empirical process is a sequence of random variables that are functions of the sample, as the sample size grows. Usually, the empirical process is indexed by an unknown function. Empirical process theory is typically used to study the estimation error of an estimator and is intended to give probability or moment bounds for the suprema (over a collection of functions) of an empirical process, see (4.17) for a simple case. An instance of this is the probability bounds on the uniform laws of large numbers on the empirical risk to be seen in the next section. These are often called concentration inequalities because they control the concentration of a random variable around its mean.) The second term on the right is a variant on the second term in (7.1) but represents a different approximation error, the distance from the optimum to the true model rather than the distance from an intermediate optimum to the overall optimum f. Typically, one has to take the expectation on the LHS of the last bound and look at ER( ˆf n ) R(f true ) in two cases: the well-specified case, when f true F and the mis-specified case, when f true / F. Bounds for the latter include a bias term to account for the mis-specification in addition to a variance term to indicate how fast

4 4 Chapter 1. Model Complexity the convergence occurs, see Van de Geer (2004, Sec. 2.5). Expression (7.3) is often called a Basic Inequality because it is the usual starting point for deriving bounds on how much the risk of ˆf n exceeds the optimal risk. Is it reasonable to regard risk as an assessment of complexity? The answer appears to be yes because the risk over a class F is often exquisitely sensitive to class itself. Indeed, it will be seen in general that an index of complexity called the Vapnik- Chervonenkis dimension of a function class such as F can be meaningfully defined and the risk over class F can be bounded by expressions involving the VC dimension. The VC dimension can reasonably be regarded as a dimension since it generalizes the usual notion of real dimension and has a geometric interpretation (shattering) in real spaces that corresponds to an intuitive property of real dimension. Thus, it is reasonable to regard the VC dimension as an index of complexity as well. Since the risk can be bounded by the VC dimension the risk can plausibly be regarded as another measure of complexity. Apart from bounding the risk by the VC dimension, the hypersurface the risk defines as a real valued function of its arguments almost (but not really) provides a topology for function spaces. This means that if a true function is fixed, then the risk can be used as an evaluation of how close a second function is to it. That is, as the second function varies, perhaps becoming more complicated by the addition of extra terms or by allowing more irregular behavior, the risk will detect the changes, i.e., respond numerically to the qualitative changes, usually by increasing. Clearly, how to measure complexity depends on the task. As noted, dimension is one way to assess a model or collection of models and the first section will focus on the VC dimension and its implications. The central role of VC dimension in controlling risk is often called empirical risk minimization, ERM. Given that risks can be bounded by VC dimension, it is possible to choose a model class by a procedure called structural risk minimization, SRM. The structure is a sequence of function classes or model spaces and the goal is to find the function class in the sequence that achieves the best overall risk. Thus, SRM is a complexity based model selection principle. A different concept of complexity arises in Probably Approximate Correct learning. This setting is based on classification and insists that regardless of the true distribution a procedure should be able to identify the set on which a function is one to arbitrary exactitude with high probability with a suitably large but fixed sample size. This too is formally expressed in terms of risks. Finite sets are PAC-learnable and infinite sets are PAC-learnable from a fixed sample size only if their VC dimension is finite. Again, the VC dimension is the measure of complexity, but this time for learning a set from data rather than a model. Oracle inequalities can be regarded as an assessment of complexity as well. Recall, in Chap. 6 it was seen that an Oracle inequality is a way to bound the error of a procedure, in which some tuning parameters are not known, by a factor times the ideal error that would occur if the tuning parameters were known; this ideal error is usually the smallest possible. Typically, the error of a procedure is a form of risk. Note that not knowing the value of a tuning parameter means it must be estimated, thereby making an inference procedure more complex. Thus, an Oracle inequality is a sense in which the extra complexity from estimating components in a model is

5 1.1. VC Dimension: Definition, Properties, Bounds 5 not unreasonable compared with the full knowledge ideal. A fourth notion of complexity may be the most familiar: Description length for a data set can be used as a measure of complexity. The longer it takes to describe a data set or anything else, such as a model class presumably the more information there is in it and hence the more complex it is. This is formalized in information theory and minimum complexity criteria can be used to estimate parameters, select models or priors, and make predictions. Indeed, the popular Akaike and Bayes information criteria arise from information theoretic reasoning. Although not obvious, minimal description lengths too are risk based because they emerge from the empirical minimization of a Bayes risk using entropy loss, see Barron and Cover (1990), Rissanen (1996). 1.1 VC Dimension: Definition, Properties, Bounds The point of VC dimension is to assign a notion of dimensionality to collections of functions that do not necessarily have a linear structure. It often reduces to the usual real notion of independence but not always. The issue is that just as dimension in real vector spaces represents the portion of a space a set of vectors can express, VC dimension for sets of functions rests on what geometric properties the functions can express in terms of classification. In DMML, the VC dimension helps set bounds on the performance capability of procedures. Consider a training sample {(y i, x i ) : i = 1,..., n} the task is to predict a future Y value from a new X value. The usual procedure is to choose a family of models F = {f(x, θ) : θ Θ} with f(, θ) : IR p IR for each θ and use the training sample to find a value ˆθ that gives good performance. As seen in earlier chapters, this structure holds in both regression and classification: In multiple linear regression, F consists of all p-dimensional hyperflats and the θ s are regression coefficients. In two-class linear discriminant analysis, F consists of all p-dimensional hyperflats, the θ s come from a Mahalanobis distance rule. As in Chapter 4, a loss function L : A IR IR, written as L[f(x new, ˆθ), y new ], can be used to measure the deviation between a predicted value f(x new, ˆθ) and a new value y new. In this notation, A is a collection of actions, the possible choices that can be made depending on what information is available. Making a prediction about a future value or giving an estimate of a parameter are typical actions. Standard loss functions include: absolute loss: L[f(x, θ), y] = f(x, θ) y squared error loss: L[f(x, θ), y] = [f(x, θ) y] 2. L[f(x, θ), y] = I[ yf(x, θ) > 0]. The first two are common in regression settings because they express the idea that the further wrong a prediction is, in a metric sense, the higher the cost should be. The third is appropriate for binary classification, labeled so that y { 1, 1} so that the indicator function that is 1 iff the sign of y is different from the sign of f(x, ˆθ). (The usual rule in classification is to predict 1 or -1 according to the sign

6 6 Chapter 1. Model Complexity of the classification rule f(x, ˆθ); the minus sign in the indicator means that when f gives the right classification the indicator function is 0.) The risk provides an evaluation of the performance of any model in F. Assume that the joint distribution of future values of (Y, X) has density p(y, x). Then the risk, or expected loss in a future prediction, is R(θ) = L[f(x, θ), y] p(y, x)dydx. (7.4) IR p IR This cannot be calculated without knowing p(y, x), which is typically unknown, so minimizing (7.4) is typically impossible. To get around this, the empirical risk, based on the sample, can be used. It is ˆR n (θ) = 1 n L[f(x i, θ), y i ]. (7.5) The empirical risk is a sample mean and so is subject to LLN s and CLT s. The goal here, however, is to find a bound B on the risk so that with high probability, R(θ) ˆR n (θ) + B. (7.6) This will indicate how bad the empirical risk is as an estimate of the true risk. If the bound is not too loose, then, in principle, instead of seeking the minimum risk action it would be enough to minimize the bound ˆR n (θ) + B in place of minimizing the risk or the empirical risk by itself. This procedure is called Empirical Risk Minimization, ERM. In general, consistency of ERM solutions for the actual minimum risk solutions will be essential. Consistency can be expressed independently of a specific family F as inf α Λ(c) ˆRn (α) inf α Λ(c) R(α) (7.7a) in probability as n for any F (z), where { } Λ(c) = α Q(z, α)df (z) c. (7.7b) In this definition, F is the joint distribution for z that represents (Y, X), and Q(z, α) is a collection of functions, with generic index α. The index α can represent a parameter θ from a class of models f, but in general is not related to a model. For the binary classification case with the family {f(x, θ)}, Q(z, α) = L[f(x, θ), y] = I[ yf(x, θ) > 0], in which case Q α (z) is the indicator function for a set C α, so that the role of the family F is only implicit in the notation. However, the Q(z, α) s are being regarded as a collection of random variables in Z = (Y, X). Expression (7.7a) requires convergence to the minimum for the set of functions {Q α } and for all subsets of those functions obtained by removing the ones with risk less than c. It will be seen in below that, for functions {Q α }, if a Q(z, α) A (7.8)

7 1.1. VC Dimension: Definition, Properties, Bounds 7 then: I) the ERM method is consistent for F, the distribution of Z, and the functions Q(z, α) for α Λ and II) uniform 1-sided convergence of the means of the Q(z, α) s to their expectation occurs under F. In fact, the ERM is consistent in a uniform 2 sided sense. While the ERM is quite general, focus will be mostly on the classification setting with 0 1 loss because the results are clearest. This is the setting in which the Q α s are indicator functions for sets defined by F with A = 1, a = 0. To see what B in (7.6) might look like, recall (4.20). The Theorem there (stated somewhat informally) can be improved to an expression for the ERM bound for a general collection of functions F satisfying (7.8). Vapnik (1998, Sec. 5.3) shows that with probability 1 η the inequality R(α) ˆR n (θ) + (A a) E (7.9) holds for all α i.e.,, where E(n, η, h)) = (4/n)[h(ln(2n/h) + 1) ln(η/4)] and h is a non-negative integer called the Vapnik-Cervonenkis (VC) dimension which is a property of F alone. Clearly, if one has a sequence of function spaces F k, then the bound (7.9) can be applied to each of them and it is a matter of which k to choose, and the VC dimension obviously plays the dominant role. Note that (7.9) does not depend on the distribution of Z or (Y, X). So, it can be actually used to find near-optimal solutions. Aside from the finiteness of h, the validity of the ERM bound only requires the sample be randomly drawn from F. It can be seen that if h is known, then it will usually be easier to minimize the ERM bound instead of minimizing the risk directly. On the other hand, it is easy to find F s with h is so large that the bound is greater than A, in which case (7.9) is useless. Moreover, in the absence of (7.8), i.e., for a more general real valued function Q obtained for a loss function other than 0-1 loss, the bounds available under (7.8) do not appear to generalize fully. The basic principles generalize well, and there are versions of (7.9) for a variety of important cases, but there are cases where specific bounds, rates, and properties for the overall method either have not been developed or do not exist. The bulk of this section will be devoted to explications of expressions like (7.9) and the consequences for inference in DMML. This begins with the definitions needed to understand the VC dimension h and continues with the hard work of understanding its properties and how they arise in bounds on the risk VC Dimension of F: Definitions There are no less than three ways to approach defining the VC dimension. The most accessible is geometric, based on the idea of shattering a set of points. Shattering means that a collection of functions, F, regarded as curves in a surface, or hypersurfaces in a space, can be used to partition a set of n points in all 2 n ways. The VC dimension is the largest number of points that ca be shattered by F. A separate approach is essentially combinatorial. The idea of shattering a point set by the use of functions is converted into a process of picking out points in a point set by the use of sets in a given collection of sets. The VC dimension arises as the exponent in a polynomial bound (whose existence is guaranteed by a result called

8 8 Chapter 1. Model Complexity Sauer s Lemma) on how many ways a set of n elements can be partitioned by using a collection of sets to pick out subsets of them. The third way to define VC dimension is more technical. Undergirding much of the mathematics is the concept of a covering number. Given a set in a metric space, an ɛ-net is a collection of elements of the metric space so that each element in the set is at most ɛ away from some element in the collection. To be useful, an ɛ-net should be finite. The minimal cardinality of the ɛ-nets (for a fixed ɛ) for a certain set of vectors defined from a set F of, say, indicator functions, evaluated on n points, gives the covering number, for that choice of ɛ and n. The covering number is, of course, related to F and the number of different separations of collections of n points. The logarithm of the supremum of the (minimal) covering number over sets of n points is called the growth function. As a function of n it increases, but it can be proved the qualitative behavior of the function changes from linear to logarithmic at a certain integer. This integer is, again, the VC dimension. The growth funcion represents how the information gain per extra point from a class of functions increases: Past a certain point, given by h, the rate of information gain slows dramatically. Geometric Since the VC dimension h of a class of functions F depends on how they separate points, start by considering the two-class discrimination problem with a family F indexed by θ, say f(x, θ) { 1, 1}. Given a set of n points, there are 2 n subsets that can be regarded as arising from labeling the n points in all 2 n possible ways with 0, 1. Now, fix any one such labeling and suppose there is a θ so that f(x i, θ) assigns 1 when x i has the label 1 and -1 when x i has the label 0. This means that f(, θ) is a member of F that correctly assigns the labels. If for each of the 2 n labelings, there is a member of F that can correctly assign those labels to the n points then the set of points is shattered by F. The VC dimension for F is the maximum number of points that can be shattered by the elements of F. It is convenient to think of the points as data points and n is the sample size. Now, the VC dimension of a set of indicator functions I α (z), generated by say F, where α Λ indexes the domains on which I = 1, is the largest number h of points that can be separated into two different classes in all 2 n possible ways using that set of indicator functions. If there are n distinct points z 1,..., z n (in any configuration) in a fixed space that can be separated in all 2 n possible ways then the VC dimension h is at least n. That is, it is enough to shatter one set of n vectors to show the dimension is at least n. If, for every value of n, there is a set of n vectors that can be shattered by the I(z, α) s then F has VC dimension infinity. So, to find the VC dimension of a collection of functions on a real space, one can test each n = 1, 2, 3... to find the first value of n for which there is a labeling that cannot be replicated by the functions. It is important to note that the definition of shattering is phrased in terms of all possible labelings of n vectors as represented by the support of the indicator functions which is in some fixed space. So, the VC dimension is for the set F whose elements define the supports of the indicator functions, not the space itself. In a sense, the VC-dimension is not of F itself so much as the level sets defined by F

9 1.1. VC Dimension: Definition, Properties, Bounds 9 since they generate the indicator functions. Indeed, the definition of VC dimension for a general set of functions F, not necessarily indicator functions, is obtained from the indicator functions from the level sets of F = {f α ( ) : α Λ}. Let f α F be a real valued function. Then the set of functions I {z:fα(z) β 0}(z) for α Λ, β ( inf z,α f α(z), sup z,α ) f α (z) (7.10) is the complete set of indicators for F. The VC dimension of F is then the maximal number h of vectors z 1,...,z h that can be shattered by the complete set of indicators of F, for which the earlier definition applies. The geometric definition is often the easiest to use for examples, even though the other definitions are often easier for proving theorems. The intuition behind shattering will be clearer shortly, after a few examples are presented to see what it means in typical cases. It will also be seen later that when F consists of indicators so that (7.8) holds, results are much easier to prove and cleaner to state than for the case of a general class of real valued functions. Combinatorial The definition of VC dimension can built up from picking out subsets of a set, a concept related to shattering. This gives the VC dimension of a set which can then be extended to function classes. This was the original way that Vapnik and Chervonenkis (1971) expressed the idea. Let C be a collection of subsets of a space X and let {x 1,..., x n } be a set of n points in X and let C {x 1,..., x n } for C C be the collection of sets of {x 1,..., x n } picked out by C. Similar to before, the collection C shatters {x 1,..., x n } if and only if all 2 n of its subsets can be picked out by C, and the VC dimension h of the class C of sets is the smallest n for which no set of size n is shattered by C. Formally, van der Vaart and Wellner (1996) define n (C, x 1,..., x n ) = # {{x 1,..., x n } C : C C}, (7.11a) so that the VC dimension is { } h = inf n : max {x n(c, x 1,..., x n ) < 2 n. (7.11b) 1,...,x n} The infimum over an empty set can be defined to be zero, so the VC dimension of a set is only infinity when C shatters arbitrarily large sets. Picking out points from a set is much like checking which indicator functions are 1 and inferring the location of a point by examining their supports. Essentially, the combinatorial interpretation of VC dimension is an effort to formalize the idea that finitely many data points only permit us to distinguish finitely many events. Intuitively, the idea is to use the data to form clusters of events. Consider the equivalence relation defined by saying that two events are in the same equivalence class of events if and only if they pick out the same data points. Then, the number of equivalence classes depends on the data set and the set of functions F. For fixed n, this is heuristically like finding a data driven sub-σ-field from the σ field generated by C.

10 10 Chapter 1. Model Complexity Let f F, a class of functions. The subgraph of f : X IR is {(x, t) : t < f(x)} (7.12) and the collection F of measurable functions on a sample space is called a VC class if the collection of all subgraphs for the f s in F forms a class of sets with finite VC dimension. It is not too hard to see that (7.12) is equivalent to the geometric definition when F is a collection of indicator functions: Draw the subgraph for a generic indicator function and observe that a set of points can be shattered in the geometric sense by indicator functions if and only if the corresponding subgraphs of the indicator functions shatter the set of points in the picking out definition with. More generally, results in van der Vaart and Wellner (1996, Sec ) suggest that (7.12) remains equivalent to (7.10): The indicators of the level sets for a function in (7.10) correspond to subsets of the subgraph of the function so the subgraph of a function appears to hold all the shattering properties in the level sets when the VC dimension is finite, and conversely. (Problem 10 and 11 in van der Vaart and Wellner 1996 show how to partition the subgraph of a function in some cases.) By definition, a VC class of sets picks out strictly less than 2 n subsets from any set of n h = V Cdim. Intuition from the behavior of continuous functions suggests that as the size of the point set a class of sets tries to shatter increases the fewer the number of sets the class can pick out. That is, if n = h + 1, then even though 2 n 1 sets could be picked out, the actual number is polynomial, not the exponential of a smaller exponent. This surprising fact is called Sauer s Lemma, see Sauer (1972). It is of interest in its own right and will be important in the more technical definition of VC dimension in the next subsection. Let the shatter function for a class C be τ(n) = max {x n(c, x 1,..., x n ), (7.12a) 1,...,x n} the largest number of subsets of the x i s one can get by intersection it with sets in C. If a set of size n is shattered by C then V Cdim(C) n and τ(n) = 2 n. If there is no such set then V Cdim(C) < n and τ(n) < 2 n. Theorem (Sauer s Lemma): If V Cdim(C) = h, then τ(n) n h. Proof: Following Ben-David (2003), see also van der Vaart and Wellner (1996, Sec. 2.6), if A is any set in X then the condition #{A C : C C} #{B A : C shatters B} implies Sauer s Lemma because V Cdim(C) = h implies that #{B A : C shatters B} h ( ) #(A) #(A) h. i i=0 (7.13a) (7.13b) Expression (7.13a) follows by an induction argument on #(A). For #(A) = 0 both sides of (7.13a) are 1. If (7.13a) is true for all sets smaller than A and collections of sets C, it is enough to prove (7.13a) for a fixed but arbitrary A and a general C.

11 1.1. VC Dimension: Definition, Properties, Bounds 11 Let x A and set A(x) = A {x}; A(x) is now smaller than A so the induction hypothesis can be applied to it. Now define three collections of sets. The first is the sets of A picked out by C: C(A) = {C A : C C}. The second is the sets of A(x) picked out by C: C(A(x)) = {C A(x) : C C}. The third is the collection of all subsets of A(x) that are in C(A) (and so do not contain x) but can be augmented by x to give another set in C(A): The first step is to show C(A(x), aug) = {B A(x) : B and B {x} C(A)}. #(C(A)) = #(C(A(x))) + #(C(A(x), aug)) : (7.14) Let B A(x). Then there are four possibilities for B and B x in the definition of C(A(x), aug). First, if B C(A) and B {x} C(A) then B C(A(x)) and in C(A(x), aug) while B and B {x} are 2 sets that contribute to C(A). So, both sides of (7.14) are affected equally. Second, if B C(A) but B {x} / C(A), then B C(A(x)) but not in C(A(x), aug). So, it contributes one set the right hand side of (7.14) and one set to the second term on the right side of (7.14). Third, if B / C(A) but B {x} C(A), then B {x} contributes one to the LHS of (7.14). However, B {x} = Cx A for some Cx C. Therefore, B {x} = Cx A = (Cx [A {x}]) (Cx {x}) = Cx A(x) φ C(A(x)) which means it contributes one set to the first term on the right in (7.14). Finally, if B / C(A) but B {x} C(A), the no set is contributed to any of the terms in (7.14), exhausting the possibilities and establishing (7.14). The second, and last, step in the proof is to apply the induction hypotheses to both sides of (7.14) and obtain (7.13a). Note that both collections of sets on the right in (7.14) contain only subsets of A(x) making C A(x) = C. So, the two terms can be bounded as follows. First, #(C(A(x))) = #{C A(x) : C C(A(x))} #{B A(x) : C(A(x)) shatters B} = #{B A {x} : C shatters B} = #{B A : x / B and C shatters B}. (7.15a) Similarly, #(C(A(x)), aug) = #{C A(x) : C C(A(x), aug)} #{B A(x) : C(A(x), aug) shatters B} = #{B A {x} : C shatters B {x}} = #{B A : x B and C shatters B}. (7.15b)

12 12 Chapter 1. Model Complexity Using (7.15a,b) in (7.14) gives #(C(A)) = #(C(A(x))) + #(C(A(x), aug)) #{B A : C shatters B}. An extension of Sauer s Lemma, van der Vaart and Wellner (1996), states that Entropy Numbers h 1 ( ) ( ) h 1 n ne τ(n). (7.13c) j h 1 j=0 The goal is to assign a class of functions F a measure that generalizes the concept of real dimension so that a bound like (7.6) or (7.9) can be given. This will necessitate defining covering numbers, their finite sample analogs called random entropy numbers, and then generalizations of entropy numbers called the annealed entropy and the growth function. The VC dimension will emerge from a key shape property of the growth function. This subsection is largely a revamping of parts of Vapnik (1998, Chap. 3). To start, recall the simplest measure of the size of a set is its cardinality, but since F is typically uncountable that s no help. What is a help is the concept of a covering number since it is also a measure of size. Since F is a subset of a normed space with norm, given an ɛ one can measure F s size by the minimal number of balls of radius ɛ required to cover F. This is called the covering number N(ɛ, F, ). The ɛ-entropy of F, or the Boltzmann entropy, is ln N(ɛ, F, ), its natural log. The covering number of a class of indicator functions F is related to the number of subsets that can be picked out from {x 1,..., x n } by the functions in F. Indeed, write F = {1 C : C C} so that C is the class of sets whose indicator functions are in F. Now, a complicated argument based on Sauer s Lemma gives a bound on the covering number for F. The result is that there is a universal constant K so that for 1 > ɛ > 0 and r 1, if h is the (geometric) VC dimension of C then N(ɛ, F, L r (µ)) = N(ɛ, C, L r (µ)) Kh(4e)h ɛ r(h 1) where L r is the Lebesgue space with r-th power norm with respect to µ, see van der Vaart and Wellner (1996, Chap. 2.6). The more typical definition of the entropy, also called the Shannon entropy, is minus the expected log of the density; this is the expectation of a sort of random entropy. The log density plays a role similar to the Boltzmann entropy because both represent a codelength. The log density is the codelength for identifying which X has occurred; the log ɛ-entropy is the codelength for the number of configurations possible in a thermodynamic system. They are equal for uniform densities and, more generally, are related when the configurations can be grouped together to make a non-uniform distribution. The fourth section in this chapter will treat the Shannon entropy in detail. To see informally how an ɛ-entropy might arise in the ERM, observe that a variant

13 1.1. VC Dimension: Definition, Properties, Bounds 13 on (7.7) is the uniform two-sided convergence condition ( ) P sup Q(z, α)df (z) 1 Q(z i, α) > ɛ 0, (7.16) α n as n and that when, for each α, the Q(, α) s are indicator functions for sets C α, the second term is just the empirical distribution from z 1,...,z n, say ˆF n, based on the functions in F. That is, (7.16) is ( ) P sup P ({Z Q(Z, α) > 0}) ˆF n ({Z Q(Z, α) > 0}) > ɛ α 0, (7.17) in which the probability P in (7.17) applies to the randomness in Z 1,...,Z n. For a single fixed α, Hoeffding s inequality gives ( P P ({Z Q(Z, α) > 0}) ˆF ) n ({Z Q(Z, α) > 0}) > ɛ 2e 2nɛ2, (7.18) so the challenge is how to obtain uniformity over general sets of α s i.e., for the whole set of events {z Q(z, α) > 0} where α Λ. As noted in Vapnik (1998), the case of finitely many α s in (7.17) can be handled as well. For α 1,...,α K, ( ) P max P ({Z Q(Z, α k) > 0}) ˆF n ({Z Q(Z, α k ) > 0}) > ɛ 1 k K K P k=1 ( P ({Z Q(Z, α k ) > 0}) ˆF ) n ({Z Q(Z, α k ) > 0}) > ɛ 2Ke 2nɛ2 = 2e n(ln K/n 2ɛ2). (7.19) So, if K = K(n) and ln K/n 0, it is reasonable to expect that (7.16) will hold. Moreover, if K is regarded as a number of ɛ balls and (7.18) holds for a small neighborhood around each α at the centers of the ɛ balls, then if the ɛ is chosen small enough (7.19) should hold for α Λ provided the function space is not too large in a VC dimension sense. That is, the ɛ-entropy is analogous to the ln K in (7.19). This is a metric variant of the usual compactness argument that extracts a finite open cover from a collection of balls each of which has a certain desirable property. Here, however, covering F or Λ to control the convergence can be done in several different senses and one of them will give a definition of the VC dimension. While the relationships among the ɛ-entropy of F, the VC dimension of F, and other quantities are deep, all of these are population quantities. By contrast, for ERM, the importance is on their data driven analogs. So, given a finite sample, it is important to have an analog of covering number. This is provided by counting the number of sets a sample of size n can distinguish. Define two events as distinguishable using a sample z 1,..., z n, if and only if there is at least one z i that belongs to one event and not the other. It is immediate that not all sets in a large class C are distinguishable

14 14 Chapter 1. Model Complexity given a sample; different samples can have different collections of distinguishable sets; and the number of distinguishable sets can depend on the sample as well. The next task is to identify which sets it is important for a sample to be able to distinguish. For a set of indicator functions Q(z, α) with support sets C α and α Λ, consider the vector of length n q(α) = (Q(z 1, α),..., Q(z n, α)). (7.20) For fixed α, q(α) is a vertex of the unit cube in IR n. As α ranges over Λ, q(α) hops from vertex to vertex. Let N(Λ, z 1,..., z n ) be the number of vertices q(α) lands on. Then, N(Λ, z 1,..., z n ) is the number of sets the sample z 1,...,z n can distinguish. Actually, N(Λ, z 1,..., z n ) is the same as (C, z 1,..., z n ) if C corresponds to the collection of sets that can be the supports of the indicator functions Q α. Let the Z i be IID P. Then, N(Λ, Z 1,..., Z n ) is a random variable bounded by 2 n for given n. The random entropy of the set F of indicator functions Q(, α) for α Λ is ln N(Λ, z 1,..., z n ) and the entropy of F is H(Λ, n) = ln N(Λ, z 1,..., z n )df (z 1,..., z n ). (7.21) The idea will be to use N(Λ, z 1,..., z n ) or H(Λ, n) in place of K in (7.19). It will be possible, in the next section, to show that if H(Λ, n) increases slowly compared to n that uniform convergence over F can be obtained. The entropy of real valued functions in general is similar to the indicator function case, but uses the concept of a covering number more delicately. First consider a set of bounded real valued functions F C defined by α Λ Q(z, α) C. Again q(α, z 1,..., z n ) can be used to generate a region R in the n dimensional cube of sidelength 2C. The random covering number of R for a given ɛ is N(Λ, ɛ, z 1,..., z n ) giving ln N(Λ, ɛ, z 1,..., z n ) as the random ɛ-entropy. (To define the covering number, the norm must be specified as well as the ɛ and region to be covered. In this case, the natural choice is the metric ρ(q(α), q(α )) = max,...,n Q(z i, α) Q(z i, α ).) The ɛ-entropy (not random) for F C is H(Λ, ɛ, n) = H(Λ, ɛ, z 1,..., z n )df (z 1,..., z n ). (7.21) Note that if F C is consists of indicator functions then for ɛ (0, 1), N(Λ, ɛ, z 1,..., z n ) is independent of ɛ and N(Λ, ɛ, z 1,..., z n ) = N(Λ, z 1,..., z n ) so the definition of the random covering number for bounded functions reduces to that for indicator functions. This holds for the random entropy too, and gives H(Λ, ɛ, n) = H(Λ, n). For a general class F of functions, it is unclear how to define the entropy. However, here, it will not be needed. It will be enough to impose the moment condition < a Q(z, α)df (z) A < (7.22)

15 1.1. VC Dimension: Definition, Properties, Bounds 15 for all α Λ, the envelope condition that sup Q(z, α) df ((z) <, (7.23) α Λ and the truncation condition that the F C formed by truncating f F from above at C and from below at C has finite entropy. Equipped with the entropy, it is easy to specify two closely related quantities that will arise in the bounds to be established that involve the VC dimensions. First, the annealed entropy is H ann (Λ, n) = ln EN(Λ, Z 1,..., Z n ) (7.24a) and, second, the growth function is G(Λ, n) = ln sup z1,...,z n N(Λ, z 1,..., z n ). (7.24b) Clearly, H(Λ, n) H ann (Λ, n) G(Λ, n), the result of Jensen s inequality on the entropy H(Λ, n) and taking the supremum of the integrand. The growth function does not depend on the probability measure; it is purely data driven. So, an ERM bound using G(Λ, n) may be preferred over entropy or annealed entropy bounds. Clearly, H ann (Λ, n) G(Λ, n) implies that any bound using entropy gives a bound in terms of growth. On the other hand, the three conditions lim n H(Λ, n)/n = 0, lim n H ann (Λ, n)/n = 0, and lim n G(Λ, n)/n = 0 are increasingly stringent and will be seen to give greater and greater control on bounds of the form (7.19) that undergird ERM. The key theorem linking VC-dimension to a collection of indicator functions is the following. It provides an interpretation for h separate from its definition in terms of shattering or (7.11b). Although h arises in the argument by using the geometric definition of shattering, in principle, there is no reason not to use the result of this Theorem as a third definition for h. Let C(n, i) be the number of combinations of i items from n items. Theorem (Vapnik, 1998): The growth function G(Λ, n) for a set of indicator functions Q(z, α), α Λ is linear-logarithmic in the sense that { = n ln 2 if n h; G(Λ, n) ln( h i=0 C(n, i) h ln(en/h) = h(1 + ln(n/h)) if n > h. Comment: If h = then the second case in the bound never applies so G(Λ, n) = n ln 2. The essence of the result is that if a set of functions has finite VC dimension, then its growth function is initially linear, meaning that the amount of learning per datum accumulated is proportional to the sample size, up to the VC dimension, after which the learning per datum drops off to a log rate. Outline of Proof: The strategy is to examine the relationship between N(Λ, z 1,..., z n ) and sums of combinations for they will lead to the expression in the theorem. This is characterized by three Facts whose (nontrivial) proofs are omitted. Fact 1: Suppose z 1,...,z n satisfies N(Λ, z 1,..., z n ) > m 1 i=0 C(n, i) for some m. Then there is a subsequence z1,..., zm of length m so that N(Λ, z1,..., zm) = 2 m.

16 16 Chapter 1. Model Complexity The proof is an elaborate induction argument on n and m. Fact 2: Suppose that for some m. Then, n > m we have that where Φ(m, n) = m 1 i=0 C(n, i). Fact 3: For n > m, we have that sup z1,...,z m+1 N(Λ, z 1,..., z m+1 ) 2 m+1 sup z1,...,z n N(z 1,..., z n ) Φ(m + 1, n) nm 1 Φ(m, n) < 1.5 (m 1)! < ( ) m 1 en. m 1 Now, put the three facts together. Since sup z1,...,z m+1 N(Λ, z 1,..., z m+1 ) 2 m+1, whenever there is get strict inequality Fact 2 can be applied. Thus, as soon as there is any compression in the sense that fewer than the maximum number of sets are needed to cover the vectors q(α), Facts 2 and 3 can be combined to get ( sup z1,...,z n N(Λ, z 1,..., z n ) Φ(m + 1, n) < 1.5 nm en ) m (m)! <. m Taking logs gives the logarithmic part of the bound in terms of m. That is, as soon as the number of sets required to cover the q(α) s falls below the maximal number of sets, as defined by the partitions of the data z1 n generated from F by way of indicator functions L(f α, y) = Q(z, α), the logarithmic term appears. Note that once the reduction has occurred for one n, it occurs for every n afterward. The logarithmic bound begins to apply for the smallest m permitting the reduction and this arises when the functions in F can no longer shatter the data, i.e., two or more of the data points cannot be separated by indicator functions defined by the functions in F. This Theorem can be restated in Sauer s Lemma terms as the following. In this sense, the combinatorial definition of h, see (7.11b), leads to the same interpretation of h as in the last Theorem: h is the changepoint where the rate of increase of a function changes from linear in n to logarithmic in n. Note that n plays the same role as the growth function; they differ mostly by the use of the logarithm. Theorem: Let C be a class of subsets. Then, either or n (C, z 1,..., z n ) sup n (C, z 1,..., z n ) = 2 n z 1,...,z n { = 2 n if n h; ( h i=0 C(n, i) (en/h)h if n > h, where h is the last integer n for which the equality in the first case is valid. Proof: This is a restatement of the last Theorem using the Facts stated there.

17 1.1. VC Dimension: Definition, Properties, Bounds 17 Note that the combinatorial definition of h is implicit in the use of n, but that the bound can be taken as a definition of h. This means the bound gives a definition of h which is de facto equivalent to the combinatorial definition which was already seen to be equivalent to the geometric definition by shattering. For completeness, it is important to give explicit definitions for the annealed entropy and growth function for a general set of real valued functions; this is the case that is most likely to appear. The entropy and VC dimension have already been defined for this case, so recall that a set of indicator functions Q α can be used to define the entropy as in (7.20) and (7.21) and suppose the Q α s arise from a collection of real valued functions f α as in (7.10). Let N β (Λ, z n 1 ) be the number of different separations of n vectors by the complete set of indicators for the Q(z, α) s; the subscript β indicates the use of the level sets in (7.10). The random entropy of the set of real valued functions Q(z, α) is now H β (Λ, z n 1 ) = ln N β (Λ, z n 1 ) and the annealed entropy of the Q(z, α) s is H ann,β (Λ, n) = ln EN β (Λ, z n 1 ). (7.25a) The growth function of the Q(z, α) s is G β (Λ, n) = ln max N β (Λ, z n z1 n 1 ) (7.25b) and the VC-dimension of the real valued functions Q(z, α) for α Λ is the largest number h of vectors z1 n that can be shattered by the complete set of of indicators I(Q(z, α) β) for α Λ and β in the interval in (7.10). This is the same as before, except that the dependence on the level sets of the f β s has been indicated. Indeed, as with (7.21), (72.4a,b) H ann,β (Λ, n) G β (Λ, n) h(ln(n/h) + 1) Examples To get a sense for how the definition of shattering leads to a concept of dimension, it s worth seeing that VC dimension often reduces to simple expression that are minor modifications of the conventional dimension in real spaces. An amusing first observation is that the collection of indicator functions on IR with support (, a] for a IR has VC dimension 2 because it cannot pick out the larger of two points x 1 and x 2. However, the collection of indicator functions on IR with support (a, b] for a, b IR has VC dimension 3 because it cannot pick out the largest and smallest of three points x 1, x 2, and x 3. The natural extensions of these sets in IR n have VC dimension d + 1 and 2d + 1. Now, consider planes through the origin in IR n. That is, let F be the set of functions of the form f θ (x) = θ x = n θ ix i for θ = (θ 1,..., θ n ) and xx = (x 1,..., x n ). The task is to determine the highest number of points that can be shattered by F. It will be seen that the answer depends on the range of x. First, suppose that x varies over all of IR n. Then the VC dimension of F is n + 1. To see this, recall that the shattering definition requires thinking in terms of partitioning points sets by indicator functions. So, associate to any f θ the indicator function I {x:fθ (x)>0}(x) which is 1 when f θ > 0 and zero otherwise. This is the same

18 18 Chapter 1. Model Complexity as saying the points on one side of the hyperplane f θ (x) 0 are coded 1 and the others 0. (A minus sign gives the reverse, 0 and 1.) Now ask: How many points in R n must accumulate before they can no longer be partitioned in all possible ways? More formally, if there are k points, how large must k be before the number of ways the points can be partitioned by indicator functions I {x:fθ (x)>0}(x) falls below 2 k? Sauer s Lemma and the last theorem guarantee that such k exists; the question is to find it for a given n. One way to proceed is to start with n = 2 and test values of k. So, consider k = 1 point in IR 2. There are 2 ways to label the point, 0 and 1, and the two cases are symmetric. The class of indicator functions obtained from F is I {x:fθ (x)>0}(x). Given any labeling of the point by 0 or 1, any f F gives a labeling and f gives the other. So, the VC dimension is at least 1. Next, consider 2 points in IR 2 : There are four ways to label the two points with 0 and 1. Suppose the two points do not lie on a line through the origin unless one is the origin. It is easy to find one line through the origin so that both points are on the side of it that gets 1, or that gets zero. As long as the 2 points are not on a line through the origin (and are distinct from the origin) there will be a line through the origin so that one of the points is on the side of the line that gets 1 and the other will be on the side of the line that gets 0. So, there are infinitely many pairs of points that can be shattered. Picking one means the VC dimension is at least 2. Now, consider 3 points in IR 2. To get VCdim at least three, it is enough to find 3 points that can be shattered. If none of the points is the origin, typically they cannot be shattered. However, if one of the points is the origin, and the other two are not collinear with the origin then the three points can be shattered by the indicator functions. So, VC dimension is at least three. In fact, in this case, VCdim cannot be 4 or higher. There is no configuration of 4 points, even if one is at the origin, that can be shattered by planes through the origin; this can be seen by a modification of the proof of the next theorem. If n = 3, then the same kind of argument produces 4 points that can be shattered (one is at the origin) and, again, the next proof can be modified to show that 4 is the maximal number of points that can be shattered. Higher values of n are also covered by the theorem and establish that V Cdim(F) = n + 1. The real dimension of the class of indicator functions I {x:fθ (x)>0}(x) for f F is, however, n, not n+1. The discrepancy is cleared up by looking closely at the role of the origin. It is uniquely easy to separate from any other point because it is always on the boundary of the support of an indicator function. As a consequence, the linear independence of the components x i in x is indeterminate when the values of the x i s are zero. Thus, if the origin is removed from IR n so the new domain of the functions in F is IR n \ {0}, the VC dimension becomes n. To bring this out clearly in general, redefine F to be the set of indicator functions defined by F = {I P n θiφi(x) 0 : i = 1,..., n θ i IR} where the φ i s are a linearly independent collection of n functions. If there is an x 0 where all the φ i (x) s are zero, without loss of generality assume x 0 = 0 by translation. If all φ i (x) = x i, then F reduces to F which consisted of planes

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

The Decision List Machine

The Decision List Machine The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Slides by and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5) Computational Learning Theory Inductive learning: given the training set, a learning algorithm

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Lebesgue Measure on R n

Lebesgue Measure on R n CHAPTER 2 Lebesgue Measure on R n Our goal is to construct a notion of the volume, or Lebesgue measure, of rather general subsets of R n that reduces to the usual volume of elementary geometrical sets

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

VC-DENSITY FOR TREES

VC-DENSITY FOR TREES VC-DENSITY FOR TREES ANTON BOBKOV Abstract. We show that for the theory of infinite trees we have vc(n) = n for all n. VC density was introduced in [1] by Aschenbrenner, Dolich, Haskell, MacPherson, and

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

Introduction: The Perceptron

Introduction: The Perceptron Introduction: The Perceptron Haim Sompolinsy, MIT October 4, 203 Perceptron Architecture The simplest type of perceptron has a single layer of weights connecting the inputs and output. Formally, the perceptron

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

The Vapnik-Chervonenkis Dimension

The Vapnik-Chervonenkis Dimension The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and

More information

CONSIDER a measurable space and a probability

CONSIDER a measurable space and a probability 1682 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 46, NO 11, NOVEMBER 2001 Learning With Prior Information M C Campi and M Vidyasagar, Fellow, IEEE Abstract In this paper, a new notion of learnability is

More information

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero Chapter Limits of Sequences Calculus Student: lim s n = 0 means the s n are getting closer and closer to zero but never gets there. Instructor: ARGHHHHH! Exercise. Think of a better response for the instructor.

More information

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory Part V 7 Introduction: What are measures and why measurable sets Lebesgue Integration Theory Definition 7. (Preliminary). A measure on a set is a function :2 [ ] such that. () = 2. If { } = is a finite

More information

CS264: Beyond Worst-Case Analysis Lecture #14: Smoothed Analysis of Pareto Curves

CS264: Beyond Worst-Case Analysis Lecture #14: Smoothed Analysis of Pareto Curves CS264: Beyond Worst-Case Analysis Lecture #14: Smoothed Analysis of Pareto Curves Tim Roughgarden November 5, 2014 1 Pareto Curves and a Knapsack Algorithm Our next application of smoothed analysis is

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU10701 11. Learning Theory Barnabás Póczos Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we

More information

An introduction to basic information theory. Hampus Wessman

An introduction to basic information theory. Hampus Wessman An introduction to basic information theory Hampus Wessman Abstract We give a short and simple introduction to basic information theory, by stripping away all the non-essentials. Theoretical bounds on

More information

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science

More information

Consequences of Continuity

Consequences of Continuity Consequences of Continuity James K. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University October 4, 2017 Outline 1 Domains of Continuous Functions 2 The

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Web-Mining Agents Computational Learning Theory

Web-Mining Agents Computational Learning Theory Web-Mining Agents Computational Learning Theory Prof. Dr. Ralf Möller Dr. Özgür Özcep Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Exercise Lab) Computational Learning Theory (Adapted)

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Set, functions and Euclidean space. Seungjin Han

Set, functions and Euclidean space. Seungjin Han Set, functions and Euclidean space Seungjin Han September, 2018 1 Some Basics LOGIC A is necessary for B : If B holds, then A holds. B A A B is the contraposition of B A. A is sufficient for B: If A holds,

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Notes on Complex Analysis

Notes on Complex Analysis Michael Papadimitrakis Notes on Complex Analysis Department of Mathematics University of Crete Contents The complex plane.. The complex plane...................................2 Argument and polar representation.........................

More information

12 Statistical Justifications; the Bias-Variance Decomposition

12 Statistical Justifications; the Bias-Variance Decomposition Statistical Justifications; the Bias-Variance Decomposition 65 12 Statistical Justifications; the Bias-Variance Decomposition STATISTICAL JUSTIFICATIONS FOR REGRESSION [So far, I ve talked about regression

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

Statistical Learning Learning From Examples

Statistical Learning Learning From Examples Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We

More information

Existence and Uniqueness

Existence and Uniqueness Chapter 3 Existence and Uniqueness An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect

More information

Probability and Measure

Probability and Measure Probability and Measure Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Convergence of Random Variables 1. Convergence Concepts 1.1. Convergence of Real

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Probably Approximately Correct (PAC) Learning

Probably Approximately Correct (PAC) Learning ECE91 Spring 24 Statistical Regularization and Learning Theory Lecture: 6 Probably Approximately Correct (PAC) Learning Lecturer: Rob Nowak Scribe: Badri Narayan 1 Introduction 1.1 Overview of the Learning

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

MORE ON CONTINUOUS FUNCTIONS AND SETS

MORE ON CONTINUOUS FUNCTIONS AND SETS Chapter 6 MORE ON CONTINUOUS FUNCTIONS AND SETS This chapter can be considered enrichment material containing also several more advanced topics and may be skipped in its entirety. You can proceed directly

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Topological properties

Topological properties CHAPTER 4 Topological properties 1. Connectedness Definitions and examples Basic properties Connected components Connected versus path connected, again 2. Compactness Definition and first examples Topological

More information

MA103 Introduction to Abstract Mathematics Second part, Analysis and Algebra

MA103 Introduction to Abstract Mathematics Second part, Analysis and Algebra 206/7 MA03 Introduction to Abstract Mathematics Second part, Analysis and Algebra Amol Sasane Revised by Jozef Skokan, Konrad Swanepoel, and Graham Brightwell Copyright c London School of Economics 206

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012 Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv:203.093v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers

More information