Sample width for multi-category classifiers

Size: px
Start display at page:

Download "Sample width for multi-category classifiers"

Transcription

1 R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR , November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University 640 Bartholomew Road Piscataway, New Jersey Telephone: Telefax: rrr@rutcor.rutgers.edu rrr a Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, UK. m.anthony@lse.ac.uk b Electrical and Electronics Engineering Department, Ariel University Center of Samaria, Ariel 40700, Israel. ratsaby@ariel.ac.il

2 Rutcor Research Report RRR , November 2012 Sample width for multi-category classifiers Martin Anthony Joel Ratsaby Abstract.In a recent paper, the authors introduced the notion of sample width for binary classifiers defined on the set of real numbers. It was shown that the performance of such classifiers could be quantified in terms of this sample width. This paper considers how to adapt the idea of sample width so that it can be applied in cases where the classifiers are multi-category and are defined on some arbitrary metric space. Acknowledgements: This work was supported in part by the IST Programme of the European Community, under the PASCAL2 Network of Excellence, IST

3 Page 2 RRR Introduction By a multi-category) classifier on a set X, we mean a function mapping from X to [C] = {1, 2,..., C} where C 2 is the number of possible categories. Such a classifier indicates to which of the C different classes objects from X belong and, in supervised machine learning, it is arrived at on the basis of a sample, a set of objects from X together with their classifications in [C]. In [4], the notion of sample width for binary classifiers C = 2) mapping from the real line X = R was introduced and in [5], this was generalized to finite metric spaces. In this paper, we consider how a similar approach might be taken to the situation in which C could be larger than 2, and in which the classifiers map not simply from the real line, but from some metric space which would not generally have the linear structure of the real line). The definition of sample width is given below, but it is possible to indicate the basic idea at this stage: we define sample width to be at least γ if the classifier achieves the correct classifications on the sample and, furthermore, for each sample point, the minimum distance to a point of the domain having a different classification is at least γ. A key issue that arises in machine learning is that of generalization error: given that a classifier has been produced by some learning algorithm on the basis of a random) sample of a certain size, how can we quantify the accuracy of that classifier, where by its accuracy we mean its likely performance in classifying objects from X correctly? In this paper, we seek answers to this question that involve not just the sample size, but the sample width. 2 Probabilistic modelling of learning We work in a version of the popular PAC framework of computational learning theory see [14, 7]). This model assumes that the sample s consists of an ordered set x i, y i ) of labeled examples, where x i X and y i Y = [C], and that each x i, y i ) in the training sample s has been generated randomly according to some fixed but unknown) probability distribution P on Z = X Y. This includes, as a special case, the situation in which each x i is drawn according to a fixed distribution on X and is then labeled deterministically by y i = tx i ) where t is some fixed function.) Thus, a sample s of length m can be thought of as being drawn randomly according to the product probability distribution P m. An appropriate measure of how well h : X Y would perform on further randomly drawn points is its error, er P h), the probability that hx) Y for random X, Y ). Given any function h H, we can measure how well h matches the training sample through its sample error er s h) = 1 m {i : hx i) y i }

4 RRR Page 3 the proportion of points in the sample incorrectly classified by h). Much classical work in learning theory see [7, 14], for instance) related the error of a classifier h to its sample error. A typical result would state that, for all δ 0, 1), with probability at least 1 δ, for all h belonging to some specified set if functions, we have er P h) < er s h) + ɛm, δ), where ɛm, δ) known as a generalization error bound) is decreasing in m and δ. Such results can be derived using uniform convergence theorems from probability theory [15, 11, 8], in which case ɛm, δ) would typically involve a quantity known as the growth function of the set of classifiers [15, 7, 14, 2]. More recently, emphasis has been placed on learning with a large margin. See, for instance [13, 2, 1, 12].) The rationale behind margin-based generalization error bounds in the two-category classification case is that if a binary classifier can be thought of as a geometrical separator between points, and if it has managed to achieve a wide separation between the points of different classification, then this indicates that it is a good classifier, and it is possible that a better generalization error bound can be obtained. Margin-based results apply when the binary classifiers are derived from real-valued function by thresholding taking their sign). Margin analysis has been extended to multi-category classifiers in [9]. 3 The width of a classifier We now discuss the case where the underlying set of objects X forms a metric space. Let X be a set on which is defined a metric d : X X R. For a subset S of X, define the distance dx, S) from x X to S as follows: We define the diameter of X to be d x, S) := inf dx, y). y S diamx) := sup dx, y). x,y X We will denote by H the set of all possible functions h from X to [C]. The paper [4] introduced the notion of the width of a binary classifier at a point in the domain, in the case where the domain was the real line R. Consider a set of points {x 1, x 2,..., x m } from R, which, together with their true classifications y i { 1, 1}, yield a training sample s = x j, y j )) m j=1 = x 1, y 1 ), x 2, y 2 ),..., x m, y m )). We say that h : R { 1, 1} achieves sample margin at least γ on s if hx i ) = y i for each i so that h correctly classifies the sample) and, furthermore, h is constant on each of

5 Page 4 RRR the intervals x i γ, x i + γ). It was then possible to obtain generalization error bounds in terms of the sample width. In this paper we use an analogous notion of width to analyse multi-category classifiers defined on a metric space. For each k between 1 and C, let us denote by S h k or simply by S k) the sets corresponding to the function h : X [C], defined as follows: S k := h 1 k) = {x X : hx) = k}. 1) We define the width w h x) of h at a point x X as follows: w h x) := min l hx) dx, S l). In other words, it is the distance from x to the set of points that are labeled differently from hx). The term width is appropriate since the functional value is just the geometric distance between x and the complement of S hx). Given h : X [C], for each k between 1 and C, we define f h k : X R by f h k x) = min l k dx, S l) dx, S k ), and we define f h : X R C by setting the kth component function of f h to be fk h : that is, f h ) k = fk h. Note that if hx) = k, then fk hx) 0 and f j h x) 0 for j k. The function f contains geometrical information encoding how definitive the classification of a point is: if fk h x) is a large positive number, then the point x belongs to category k and is a large distance from differently classified points. We will regard h as being in error on x X if fk h x) is not positive, where k = hx) that is, if the classification is not unambiguously revealed by f h ). The error er P h) of h can then be expressed in terms of the function f h : er P h) = P f h Y X) < 0 ). 2) We define the class F of functions as F := { f h x) : h H }. 3) Note that f h is a mapping from X to the bounded set [ diamx), diamx)] C R C. Henceforth, we will use γ > 0 to denote a width parameter whose value is in the range 0, diamx)]. For a positive width parameter γ > 0 and a training sample s, the empirical sample) γ-width error is defined as Es γ h) = 1 m ) I fy h m j x j ) γ. j=1

6 RRR Page 5 Here, IA) is the indicator function of the set, or event, A.) We shall also denote E γ s h) by E γ s f) where f = f h. Note that f h y x) γ min l y dx, S l) dx, S y ) γ l y such that dx, S l ) dx, S y ) + γ. So the empirical γ-width error on the sample is the proportion of points in the sample which are either misclassified by h or which are classified correctly, but lie within distance γ of the set of points classified differently. We recall that hx) = y implies dx, S y ) = 0.) Our aim is to show that with high probability) the generalization error er P h) is not much greater than E γ s h). In particular, as a special case, we want to bound the generalization error given that E γ s h) = 0.) This will imply that if the learner finds a hypothesis which, for a large value of γ, has a small γ-width error, then that hypothesis is likely to have small error. What this indicates, then, is that if a hypothesis has a large width on most points of a sample, then it will be likely to have small error. 4 Covering numbers 4.1 Covering numbers We will make use of some results and ideas from large-margin theory. One central idea in large-margin analysis is that of covering numbers. We will discuss different types of covering numbers, so we introduce the idea in some generality to start with. Suppose A, d) is a pseudo-)metric space and that α > 0. Then an α-cover of A with respect to d) is a finite subset C of A such that, for every a A, there is some c C such that da, c) α. If such a cover exists, then the mimimum cardinality of such a cover is the covering number N A, α, d). We are working with the set F of vector-valued functions from X to R C, as defined earlier. We define the sup-metric d on F as follows: for f, g : X R C, d f, g) = sup x X max f kx) g k x), 1 k C where f k denotes the kth component function of f. Note that each component function is bounded, so the metric is well-defined.) We can bound the covering numbers N F, α, d ) of F with respect to the sup-metric) in terms of the covering numbers of X with respect to its metric d. The result is as follows.

7 Page 6 RRR Theorem 4.1 For α 0, diamx)], ) CNα 9 diamx) N F, α, d ), α where N α = N X, α/3, d). 4.2 Smoothness of the function class As a first step towards establishing this result, we prove that the functions in F satisfy a certain Lipschitz or smoothness) property. Proposition 4.2 For every f F, and for all x, x X, max f kx) f k x ) 2 dx, x ). 4) 1 k C Proof: Let x, x X and fix k between 1 and C. We show that f k x) f k x ) 2 dx, x ). Recall that, since f F, there is some h : X [C] such that, for all x, f k x) = min l k dx, S l) dx, S k ) where, for each i, S i = h 1 i). We have f k x) f k x ) = min dx, S l) dx, S k ) min l k l k dx, S l ) + dx, S k ) min dx, S l) min l k l k dx, S l ) + dx, S k) dx, S k ) We consider in turn each of the two terms in this final expression. We start with the second. Let ɛ > 0 and suppose r S k is such that dx, r) < dx, S k ) + ɛ. Such an r exists since dx, S k ) = inf y Sk dx, y). Let s S k be such that dx, s) < dx, S k ) + ɛ. Then dx, S k ) dx, r) dx, s) + ɛ, since dx, r) < dx, S k ) + ɛ and dx, S k ) dx, s). So, dx, S k ) dx, s) + dx, x ) + ɛ < dx, S k ) + ɛ + dx, x ) + ɛ = dx, S k ) + 2ɛ.

8 RRR Page 7 A similar argument establishes dx, S k ) dx, s) dx, r) + ɛ dx, r) + dx, x ) + ɛ < dx, S k ) + dx, x ) + 2ɛ. Combining these two inequalities shows dx, S k ) dx, S k ) dx, x )+2ɛ. Since this holds for all ɛ > 0, it follows that dx, S k ) dx, S k ) dx, x ). Next we show min dx, S l) min l k l k dx, S l ) dx, x ). Suppose that min k l dx, S l ) = dx, S p ) and that min k l dx, S l ) = dx, S q ). Let ɛ > 0. Suppose that y S p is such that dx, y) < dx, S p ) + ɛ and that y S q is such that dx, y ) < dx, S q ) + ɛ. Then and dx, S q ) dx, S p ) dx, y) dx, x ) + dx, y) < dx, x ) + dx, S p ) + ɛ dx, S p ) dx, S q ) dx, y ) dx, x ) + dx, y ) = dx, x ) + dx, S q ) + ɛ. From these inequalities, it follows that dx, S p ) dx, S q ) dx, x ) + ɛ, for all ɛ, and it follows that dx, S p ) dx, S q ) dx, x ). Next, we exploit this smoothness to construct a cover for F. 4.3 Covering F Let the subset C γ X be a minimal size α/3-cover for X with respect to the metric d. So, for every x X there is some ˆx C γ such that dx, ˆx) α/3. Denote by N α the cardinality of C α. Let { } 3 diamx) 3 diamx) Λ α = λ i = iα : i =,..., 1, 0, 1, 2,..., α α and define the class ˆF to be all functions ˆf : C α Λ α ) C. Then ˆF is of a finite size equal to Λ α C Nα. For any ˆf ˆF define the extension ˆf ext : X R C of ˆf to the whole domain X as follows: given ˆf which is well-defined on the points ˆx i of the cover) then for every point x in the ball B α/3 ˆx i ) = {x X : dx, ˆx i ) α/3}, we let ˆf ext x) = ˆfˆx i ), for all ˆx i C α where, if, for a point x there is more than one point ˆx i such that x B α/3 ˆx i ), we arbitrarily pick one of the points ˆx i in order to assign the value of ˆf ext x)). There is a one-to-one correspondence 5)

9 Page 8 RRR between the functions ˆf and the functions ˆf ext. Hence the set ˆF ext = cardinality equal to Λ α C Nα. { ˆfext : ˆf ˆF } is of We claim that for any f F there exists an ˆf ext such that d f, ˆf ext ) α. To see this, first for every point ˆx i C α, consider fˆx i ) and find a corresponding element in Λ C α, call it ˆfˆx i )) such that max fˆx i)) k ˆfˆx i )) k α/3. 6) 1 k C That there exists such a value follows by design of Λ α.) By the above definition of extension, it follows that for all points x B α/3 ˆx i ) we have ˆf ext x) = ˆfˆx i ). Now, from 4) we have for all f F, max sup 1 i k x B α/3 ˆx i ) fx)) k fˆx i )) k 2dx, ˆx i ) 2α/3. 7) Hence for any f F there exists a function ˆf ˆF with a corresponding ˆf ext ˆF ext such that, given an x X, there exists ˆx i C α such that, for each k between 1 and C, fx)) k ˆf ext x)) k = fx)) k ˆf ext ˆx i )) k. The right hand side can be expressed as fx)) k ˆf ext ˆx i )) k = fx)) k ˆfˆx i )) k where 8) follows from 6) and 7). = fx)) k fˆx i )) k + fˆx i )) k ˆfˆx i )) k fx)) k fˆx i )) k + fˆx i )) k ˆfˆx i )) k 2α/3 + α/3 8) = α. Hence the set ˆF ext forms an α-covering of the class F in the sup-norm. Thus we have the following covering number bound. C Nα 3 diamx) N F, α, d ) Λ α C Nα = 2 + 1). 9) α Theorem 4.1 now follows because for 0 < α diamx)) ) 3 diamx) 3 diamx) = 6 diamx) α α α diamx). α 5 Generalization error bounds We present two results. The first bounds the generalization error in terms of a width parameter γ for which the γ-width error on the sample is zero; the second more general but looser

10 RRR Page 9 in that special case) bounds the error in terms of γ and the γ-width error on the sample which could be non-zero). Theorem 5.1 Suppose that X is a metric space of diameter diamx). Suppose P is any probability measure on Z = X [C]. Let δ 0, 1). Then, with probability at least 1 δ, the following holds for s Z m : for any function h : X [C], and for any γ 0, diamx)], if Es γ h) = 0, then er P h) 2 ) )) 36 diamx) 4 diamx) CN X, γ/12, d) log m 2 + log γ 2. δγ Theorem 5.2 Suppose that X is a metric space of diameter diamx). Suppose P is any probability measure on Z = X [C]. Let δ 0, 1). Then, with probability at least 1 δ, the following holds for s Z m : for any function h : X [C], and for any γ 0, diamx)], er P h) E γ s h) + 2 m CN X, γ/6, d) ln ) )) 18 diamx) 2 diamx) + ln + 1 γ γδ m. What we have in Theorem 5.2 is a high probability bound that takes the following form: for all h and for all γ 0, diamx)], er P h S ) E γ s h) + ɛm, γ, δ), where ɛ tends to 0 as m and ɛ decreases as γ increases. The rationale for seeking such a bound is that there is likely to be a trade-off between width error on the sample and the value of ɛ: taking γ small so that the error term E γ s h) is zero might entail a large value of ɛ; and, conversely, choosing γ large will make ɛ relatively small, but lead to a large sample error term. So, in principle, since the value γ is free to be chosen, one could optimize the choice of γ on the right-hand side of the bound to minimize it. Proof of Theorem 5.1 The proof uses techniques similar to those first used in [15, 14, 8, 11] and in subsequent work extending those techniques to learning with real-valued functions, such as [10, 3, 1, 6]. The first observation is that if Q = {s Z m : h H with E γ s h) = 0, er P h) ɛ} and T = {s, s ) Z m Z m : h H with Es γ h) = 0, Es 0 h) ɛ/2},

11 Page 10 RRR then, for m 8/ɛ, P m Q) 2 P 2m T ). This is because P 2m T ) P { 2m h : Es γ h) = 0, er P h) ɛ and Es 0 h) ɛ/2}) = P { m s : h, Es γ h) = 0, er P h) ɛ and Es 0 h) ɛ/2}) dp m s) Q 1 2 P m Q), for m 8/ɛ. The final inequality follows from the fact that if er P h) ɛ, then for m 8/ɛ, P m Es 0 h) ɛ/2) 1/2, for any h, something that follows by a Chernoff bound. Let G be the permutation group the swapping group ) on the set {1, 2,..., 2m} generated by the transpositions i, m + i) for i = 1, 2,..., m. Then G acts on Z 2m by permuting the coordinates: for σ G, By invariance of P 2m under the action of G, σz 1, z 2,..., z 2m ) = z σ1),..., z σ2m) ). P 2m T ) max{pσz T ) : z Z 2m }, where P denotes the probability over uniform choice of σ from G. See, for instance, [11, 2].) Let F = {f h : h H} be the set of vector-valued functions derived from H as before, and let ˆF be a minimal γ/2-cover of F in the d -metric. Theorem 4.1 tells us that the size of ˆF is no more than ) CN 18 diamx), where N = N X, γ/6, d). γ Fix z Z 2m. Suppose τz = s, s ) T and that h is such that Es γ h) = 0 and Es 0 h) ɛ/2. Let ˆf ˆF be such that d ˆf, f h ) γ/2. Then the fact that Es γ h) = 0 implies that ˆf) = 0. And, the fact that Es 0 h) ɛ/2 implies that Eγ/2 s ˆf) ɛ/2. To see the first claim, note that we have fy h x) > γ for all x, y) in the sample s and that, since, for all k, and all x, ˆf k x) fk hx) γ/2, we have also that for all such x, y), ˆfy x) > γ/2. For the second claim, we observe that if fy h x) 0 then ˆf y x) γ/2. E γ/2 s It now follows that if τz T, then, for some ˆf ˆF, τz R ˆf), where R ˆf) = {s, s ) Z m Z m : Es γ/2 ˆf) = 0, E γ/2 s ˆf) ɛ/2}. By symmetry, P σz R ˆf) ) = P στz) R ˆf) ). Suppose that E γ/2 s ˆf) = r/m, where r ɛm/2 is the number of x i, y i ) in s on which ˆf is such that ˆf yi x i ) γ/2. Then those

12 RRR Page 11 permutations σ such that στz) R ˆf) are precisely those that do not transpose these r coordinates, and there are 2 m r 2 m ɛm/2 such σ. It follows that, for each fixed ˆf ˆF, P σz R ˆf) ) 2m1 ɛ/2) G = 2 ɛm/2. We therefore have Pσz T ) P σz ˆf R ˆf) ˆf Pσz R ˆf)) ˆF 2 ɛm/2. ˆF ˆF So, ) CN P m Q) 2 P 2m T ) 2 ˆF 18 diamx) 2 ɛm/2 2, γ where N = N X, γ/6, d). This is at most δ when ɛ = 2 m ) )) 18 diamx) 2 CN log 2 + log γ 2. δ Next, we use this to obtain a result in which γ is not prescribed in advance. For α 1, α 2, δ 0, 1), let Eα 1, α 2, δ) be the set of z Z m for which there exists some h H with E α 2 z h) = 0 and er P h) ɛ 1 m, α 1, δ), where ɛ 1 m, α 1, δ) = 2 m ) )) 18 diamx) 2 CN X, α 1 /6, d) log 2 + log α 2. 1 δ Then the result just obtained tells us that P m Eα, α, δ)) δ. It is also clear that if α 1 α α 2 and δ 1 δ, then Eα 1, α 2, δ 1 ) Eα, α, δ). It follows from a result from [6], that P m Eγ/2, γ, δγ/2 diamx))) δ. γ 0,diamX)] In other words, with probability at least 1 δ, for all γ 0, diamx)]], we have that if h H satisfies E γ s h) = 0, then er P h) < ɛ 2 m, γ, δ), where ɛ 2 m, γ, δ) = 2 m ) )) 36 diamx) 4 diamx) CN X, γ/12, d) log 2 + log γ 2. δγ Note that γ now need not be prescribed in advance.

13 Page 12 RRR Proof of Theorem 5.2 Guermeur [9] has developed a framework in which to analyse multi-category classification, and we can apply one of his results to obtain the bound of Theorem 5.2, which is generalization error bound applicable to the case in which the γ-width sample error is not zero. In that framework, there is a set G of functions from X into R C, and a typical g G is represented by its component functions g k for k = 1 to C. Each g G satisfies the constraint g k x) = 0, x X. k=1 A function of this type acts as a classifier as follows: it assigns category l [C] to x X if and only if g l x) > max k l g k x). If more than one value of k maximizes g k x), then the classification is left undefined, assigned some value not in [C].) The risk of g G, when the underlying probability measure on X Y is P, is defined to be Rg) = P For v, k) R C [C], let Mv, k) = 1 2 X R C given by Given a sample s X [C]) m, let ) {x, y) X [C] : g y x) max g kx)}. k y ) v k max v l l k gx) = g k x)) C k=1 = Mgx), k)) C k=1. and, for g G, let g be the function R γ,s g) = 1 m m I { g yi x i ) < γ}. i=1 A result following from [9] is in the above notation) as follows: Let δ 0, 1) and suppose P is a probability measure on Z = X [C]. With P m -probability at least 1 δ, s Z m will be such that we have the following: for any fixed d > 0) for all γ 0, d] and for all g G, Rg) R γ,s g) + 2 m ln N G, γ/4, d ) + ln )) 2d + 1 γδ m. In fact, the result from [9] involves emprical covering numbers rather than d -covering numbers. The latter are at least as large as the empirical covering numbers, but we use

14 RRR Page 13 these because we have bounded them earlier in this paper.) For each function h : X [C], let g = g h be the function X R C defined by g k x) = 1 dx, S i ) dx, S k ), C i=1 where, as before, S j = h 1 j). Let G = {g h : h H} be the set of all such g. Then these functions satisfy the constraint that their coordinate functions sum to the zero function, since g k x) = k=1 For each k, k=1 1 C dx, S i ) i=1 dx, S k ) = k=1 dx, S k ) k=1 g k x) = Mgx), k) = 1 ) g k x) max 2 g lx) l k = dx, S i ) dx, S k ) max 2 C l k C i=1 = 1 ) dx, S k ) max 2 dx, S l)) l k = 1 ) min 2 dx, S l) dx, S k ) l k = 1 2 f h k x). dx, S k ) = 0. k=1 )) dx, S i ) dx, S l ) From the definition of g, the event that g y x) max k y g k x) is equivalent to the event that ) 1 1 dx, S i ) dx, S y ) max dx, S i ) dx, S k ), C k y C i=1 which is equivalent to min k y dx, S k ) dx, S y ). It can therefore be seen that Rg) = er P h). Similarly, R γ,s g) = E γ s h). Noting that G = 1/2)F, so that an α/2 cover of F will provide an α/4 cover of G, we can therefore apply Guermeur s result to see that with probability at least 1 δ, for all h and for all γ 0, diamx)], er P h) E γ s h) + E γ s h) + 2 m 2 m CN X, γ/6, d) ln i=1 ln N F, γ/2, d ) + ln i=1 )) 2 diamx) + 1 γδ m ) )) 18 diamx) 2 diamx) + ln + 1 γ γδ m.

15 Page 14 RRR Acknowledgements This work was supported in part by the IST Programme of the European Community, under the PASCAL2 Network of Excellence, IST References [1] M. Anthony and P. L. Bartlett. Function learning from interpolation. Combinatorics, Probability, and Computing, 9: , [2] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, [3] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM, 444):615631, [4] M. Anthony and J. Ratsaby. Maximal width learning of binary functions. Theoretical Computer Science, 411: , [5] M. Anthony and J. Ratsaby. Learning on finite metric spaces. RUTCOR Research Report RRR , June [6] P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE Transactions on Information Theory, 442): , [7] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 364): , [8] R. M. Dudley 1999). Uniform Central Limit Theorems, Cambridge Studies in Advanced Mathematics, 63, Cambridge University Press, Cambridge, UK. [9] Yann Guermeur. VC theory of large margin multi-category classifiers. Journal of Machine Learning Research, 8, , [10] D. Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications, Information and Computation, 100: , [11] D. Pollard 1984). Convergence of Stochastic Processes. Springer-Verlag. [12] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 445), 1996:

16 RRR Page 15 [13] A. J. Smola, P. L. Bartlett, B. Scholkopf, and D. Schuurmans. Advances in Large-Margin Classifiers Neural Information Processing). MIT Press, [14] V. N. Vapnik. Statistical Learning Theory. Wiley, [15] V.N. Vapnik and A.Y. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 162): , 1971.

Using boxes and proximity to classify data into several categories

Using boxes and proximity to classify data into several categories R u t c o r Research R e p o r t Using boxes and proximity to classify data into several categories Martin Anthony a Joel Ratsaby b RRR 7-01, February 01 RUTCOR Rutgers Center for Operations Research Rutgers

More information

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors R u t c o r Research R e p o r t The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony a Joel Ratsaby b RRR 17-2011, October 2011 RUTCOR Rutgers Center for Operations

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering

More information

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony Department of Mathematics London School of Economics and Political Science Houghton Street, London WC2A2AE

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A 2AE United Kingdom m.anthony@lse.ac.uk Joel Ratsaby Ben-Gurion

More information

A Result of Vapnik with Applications

A Result of Vapnik with Applications A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of

More information

R u t c o r Research R e p o r t. Large margin case-based reasoning. Martin Anthony a. Joel Ratsaby b. RRR , February 2013

R u t c o r Research R e p o r t. Large margin case-based reasoning. Martin Anthony a. Joel Ratsaby b. RRR , February 2013 R u t c o r Research R e p o r t Large margin case-based reasoning Martin Anthony a Joel Ratsaby b RRR 2-2013, February 2013 RUTCOR Rutgers Center for Operations Research Rutgers University 640 Bartholomew

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Large width nearest prototype classification on general distance spaces

Large width nearest prototype classification on general distance spaces Large width nearest prototype classification on general distance spaces Martin Anthony 1 and Joel Ratsaby 2 1 Department of Mathematics, London School of Economics and Political Science, Houghton Street,

More information

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Computational Learning Theory for Artificial Neural Networks

Computational Learning Theory for Artificial Neural Networks Computational Learning Theory for Artificial Neural Networks Martin Anthony and Norman Biggs Department of Statistical and Mathematical Sciences, London School of Economics and Political Science, Houghton

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

Neural Network Learning: Testing Bounds on Sample Complexity

Neural Network Learning: Testing Bounds on Sample Complexity Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do

More information

Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.

Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. Function Learning from Interpolation 1 Martin Anthony Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. Email anthony@vax.lse.ac.uk

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

The Decision List Machine

The Decision List Machine The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

The information-theoretic value of unlabeled data in semi-supervised learning

The information-theoretic value of unlabeled data in semi-supervised learning The information-theoretic value of unlabeled data in semi-supervised learning Alexander Golovnev Dávid Pál Balázs Szörényi January 5, 09 Abstract We quantify the separation between the numbers of labeled

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley

More information

Relating Data Compression and Learnability

Relating Data Compression and Learnability Relating Data Compression and Learnability Nick Littlestone, Manfred K. Warmuth Department of Computer and Information Sciences University of California at Santa Cruz June 10, 1986 Abstract We explore

More information

Generalization theory

Generalization theory Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley Department of Statistics 367 Evans Hall, CA 94720-3860, USA bartlett@stat.berkeley.edu

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

A Pseudo-Boolean Set Covering Machine

A Pseudo-Boolean Set Covering Machine A Pseudo-Boolean Set Covering Machine Pascal Germain, Sébastien Giguère, Jean-Francis Roy, Brice Zirakiza, François Laviolette, and Claude-Guy Quimper Département d informatique et de génie logiciel, Université

More information

R u t c o r Research R e p o r t. A New Imputation Method for Incomplete Binary Data. Mine Subasi a Martin Anthony c. Ersoy Subasi b P.L.

R u t c o r Research R e p o r t. A New Imputation Method for Incomplete Binary Data. Mine Subasi a Martin Anthony c. Ersoy Subasi b P.L. R u t c o r Research R e p o r t A New Imputation Method for Incomplete Binary Data Mine Subasi a Martin Anthony c Ersoy Subasi b P.L. Hammer d RRR 15-2009, August 2009 RUTCOR Rutgers Center for Operations

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

CSE 648: Advanced algorithms

CSE 648: Advanced algorithms CSE 648: Advanced algorithms Topic: VC-Dimension & ɛ-nets April 03, 2003 Lecturer: Piyush Kumar Scribed by: Gen Ohkawa Please Help us improve this draft. If you read these notes, please send feedback.

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational

More information

COMPUTATIONAL LEARNING THEORY

COMPUTATIONAL LEARNING THEORY COMPUTATIONAL LEARNING THEORY XIAOXU LI Abstract. This paper starts with basic concepts of computational learning theory and moves on with examples in monomials and learning rays to the final discussion

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Sinh Hoa Nguyen, Hung Son Nguyen Polish-Japanese Institute of Information Technology Institute of Mathematics, Warsaw University February 14, 2006 inh Hoa Nguyen, Hung Son

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Computational Learning Theory

Computational Learning Theory 09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions Adam R Klivans UT-Austin klivans@csutexasedu Philip M Long Google plong@googlecom April 10, 2009 Alex K Tang

More information

Support Vector Machine Classification via Parameterless Robust Linear Programming

Support Vector Machine Classification via Parameterless Robust Linear Programming Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified

More information

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Yale University Department of Computer Science. The VC Dimension of k-fold Union Yale University Department of Computer Science The VC Dimension of k-fold Union David Eisenstat Dana Angluin YALEU/DCS/TR-1360 June 2006, revised October 2006 The VC Dimension of k-fold Union David Eisenstat

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Models of Language Acquisition: Part II

Models of Language Acquisition: Part II Models of Language Acquisition: Part II Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 Probably Approximately Correct Model of Language Learning General setting of Statistical

More information

On A-distance and Relative A-distance

On A-distance and Relative A-distance 1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY 14853 On A-distance and Relative A-distance Ting He and Lang Tong Technical Report No. ACSP-TR-08-04-0 August 004

More information

Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data

Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data Sandor Szedma 1 and John Shawe-Taylor 1 1 - Electronics and Computer Science, ISIS Group University of Southampton, SO17 1BJ, United

More information

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti MACHINE LEARNING Vapnik-Chervonenkis (VC) Dimension Alessandro Moschitti Department of Information Engineering and Computer Science University of Trento Email: moschitti@disi.unitn.it Computational Learning

More information

1 Differential Privacy and Statistical Query Learning

1 Differential Privacy and Statistical Query Learning 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose

More information

Efficient classification for metric data

Efficient classification for metric data Efficient classification for metric data Lee-Ad Gottlieb Weizmann Institute of Science lee-ad.gottlieb@weizmann.ac.il Aryeh Leonid Kontorovich Ben Gurion University karyeh@cs.bgu.ac.il Robert Krauthgamer

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8] Computational learning theory Setting 1: learner poses queries to teacher Setting 2: teacher chooses examples Setting

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

Data-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2

Data-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2 Data-Dependent Structural Risk Minimisation for Perceptron Decision Trees John Shawe-Taylor Royal Holloway, University of London 1 Email: jst@dcs.rhbnc.ac.uk Nello Cristianini University of Bristol 2 Email:

More information

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines

More information

Computational Learning Theory

Computational Learning Theory 0. Computational Learning Theory Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 7 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Main Questions

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about

More information

A Necessary Condition for Learning from Positive Examples

A Necessary Condition for Learning from Positive Examples Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)

More information

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all 1. Computational Learning Abstract. I discuss the basics of computational learning theory, including concept classes, VC-dimension, and VC-density. Next, I touch on the VC-Theorem, ɛ-nets, and the (p,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

On Efficient Agnostic Learning of Linear Combinations of Basis Functions

On Efficient Agnostic Learning of Linear Combinations of Basis Functions On Efficient Agnostic Learning of Linear Combinations of Basis Functions Wee Sun Lee Dept. of Systems Engineering, RSISE, Aust. National University, Canberra, ACT 0200, Australia. WeeSun.Lee@anu.edu.au

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem

More information

Learning with Decision Lists of Data-Dependent Features

Learning with Decision Lists of Data-Dependent Features Journal of Machine Learning Research 6 (2005) 427 451 Submitted 9/04; Published 4/05 Learning with Decision Lists of Data-Dependent Features Mario Marchand Département IFT-GLO Université Laval Québec,

More information

The Set Covering Machine with Data-Dependent Half-Spaces

The Set Covering Machine with Data-Dependent Half-Spaces The Set Covering Machine with Data-Dependent Half-Spaces Mario Marchand Département d Informatique, Université Laval, Québec, Canada, G1K-7P4 MARIO.MARCHAND@IFT.ULAVAL.CA Mohak Shah School of Information

More information

Lecture 29: Computational Learning Theory

Lecture 29: Computational Learning Theory CS 710: Complexity Theory 5/4/2010 Lecture 29: Computational Learning Theory Instructor: Dieter van Melkebeek Scribe: Dmitri Svetlov and Jake Rosin Today we will provide a brief introduction to computational

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Formulation with slack variables

Formulation with slack variables Formulation with slack variables Optimal margin classifier with slack variables and kernel functions described by Support Vector Machine (SVM). min (w,ξ) ½ w 2 + γσξ(i) subject to ξ(i) 0 i, d(i) (w T x(i)

More information

Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective

Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective Hubert Haoyang Duan University of Ottawa hduan065@uottawa.ca June 23, 2012 Abstract: The

More information

An introduction to Support Vector Machines

An introduction to Support Vector Machines 1 An introduction to Support Vector Machines Giorgio Valentini DSI - Dipartimento di Scienze dell Informazione Università degli Studi di Milano e-mail: valenti@dsi.unimi.it 2 Outline Linear classifiers

More information

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany

More information

Multiclass Learnability and the ERM principle

Multiclass Learnability and the ERM principle Multiclass Learnability and the ERM principle Amit Daniely Sivan Sabato Shai Ben-David Shai Shalev-Shwartz November 5, 04 arxiv:308.893v [cs.lg] 4 Nov 04 Abstract We study the sample complexity of multiclass

More information

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,

More information