Sample width for multi-category classifiers
|
|
- Susanna Moore
- 6 years ago
- Views:
Transcription
1 R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR , November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University 640 Bartholomew Road Piscataway, New Jersey Telephone: Telefax: rrr@rutcor.rutgers.edu rrr a Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, UK. m.anthony@lse.ac.uk b Electrical and Electronics Engineering Department, Ariel University Center of Samaria, Ariel 40700, Israel. ratsaby@ariel.ac.il
2 Rutcor Research Report RRR , November 2012 Sample width for multi-category classifiers Martin Anthony Joel Ratsaby Abstract.In a recent paper, the authors introduced the notion of sample width for binary classifiers defined on the set of real numbers. It was shown that the performance of such classifiers could be quantified in terms of this sample width. This paper considers how to adapt the idea of sample width so that it can be applied in cases where the classifiers are multi-category and are defined on some arbitrary metric space. Acknowledgements: This work was supported in part by the IST Programme of the European Community, under the PASCAL2 Network of Excellence, IST
3 Page 2 RRR Introduction By a multi-category) classifier on a set X, we mean a function mapping from X to [C] = {1, 2,..., C} where C 2 is the number of possible categories. Such a classifier indicates to which of the C different classes objects from X belong and, in supervised machine learning, it is arrived at on the basis of a sample, a set of objects from X together with their classifications in [C]. In [4], the notion of sample width for binary classifiers C = 2) mapping from the real line X = R was introduced and in [5], this was generalized to finite metric spaces. In this paper, we consider how a similar approach might be taken to the situation in which C could be larger than 2, and in which the classifiers map not simply from the real line, but from some metric space which would not generally have the linear structure of the real line). The definition of sample width is given below, but it is possible to indicate the basic idea at this stage: we define sample width to be at least γ if the classifier achieves the correct classifications on the sample and, furthermore, for each sample point, the minimum distance to a point of the domain having a different classification is at least γ. A key issue that arises in machine learning is that of generalization error: given that a classifier has been produced by some learning algorithm on the basis of a random) sample of a certain size, how can we quantify the accuracy of that classifier, where by its accuracy we mean its likely performance in classifying objects from X correctly? In this paper, we seek answers to this question that involve not just the sample size, but the sample width. 2 Probabilistic modelling of learning We work in a version of the popular PAC framework of computational learning theory see [14, 7]). This model assumes that the sample s consists of an ordered set x i, y i ) of labeled examples, where x i X and y i Y = [C], and that each x i, y i ) in the training sample s has been generated randomly according to some fixed but unknown) probability distribution P on Z = X Y. This includes, as a special case, the situation in which each x i is drawn according to a fixed distribution on X and is then labeled deterministically by y i = tx i ) where t is some fixed function.) Thus, a sample s of length m can be thought of as being drawn randomly according to the product probability distribution P m. An appropriate measure of how well h : X Y would perform on further randomly drawn points is its error, er P h), the probability that hx) Y for random X, Y ). Given any function h H, we can measure how well h matches the training sample through its sample error er s h) = 1 m {i : hx i) y i }
4 RRR Page 3 the proportion of points in the sample incorrectly classified by h). Much classical work in learning theory see [7, 14], for instance) related the error of a classifier h to its sample error. A typical result would state that, for all δ 0, 1), with probability at least 1 δ, for all h belonging to some specified set if functions, we have er P h) < er s h) + ɛm, δ), where ɛm, δ) known as a generalization error bound) is decreasing in m and δ. Such results can be derived using uniform convergence theorems from probability theory [15, 11, 8], in which case ɛm, δ) would typically involve a quantity known as the growth function of the set of classifiers [15, 7, 14, 2]. More recently, emphasis has been placed on learning with a large margin. See, for instance [13, 2, 1, 12].) The rationale behind margin-based generalization error bounds in the two-category classification case is that if a binary classifier can be thought of as a geometrical separator between points, and if it has managed to achieve a wide separation between the points of different classification, then this indicates that it is a good classifier, and it is possible that a better generalization error bound can be obtained. Margin-based results apply when the binary classifiers are derived from real-valued function by thresholding taking their sign). Margin analysis has been extended to multi-category classifiers in [9]. 3 The width of a classifier We now discuss the case where the underlying set of objects X forms a metric space. Let X be a set on which is defined a metric d : X X R. For a subset S of X, define the distance dx, S) from x X to S as follows: We define the diameter of X to be d x, S) := inf dx, y). y S diamx) := sup dx, y). x,y X We will denote by H the set of all possible functions h from X to [C]. The paper [4] introduced the notion of the width of a binary classifier at a point in the domain, in the case where the domain was the real line R. Consider a set of points {x 1, x 2,..., x m } from R, which, together with their true classifications y i { 1, 1}, yield a training sample s = x j, y j )) m j=1 = x 1, y 1 ), x 2, y 2 ),..., x m, y m )). We say that h : R { 1, 1} achieves sample margin at least γ on s if hx i ) = y i for each i so that h correctly classifies the sample) and, furthermore, h is constant on each of
5 Page 4 RRR the intervals x i γ, x i + γ). It was then possible to obtain generalization error bounds in terms of the sample width. In this paper we use an analogous notion of width to analyse multi-category classifiers defined on a metric space. For each k between 1 and C, let us denote by S h k or simply by S k) the sets corresponding to the function h : X [C], defined as follows: S k := h 1 k) = {x X : hx) = k}. 1) We define the width w h x) of h at a point x X as follows: w h x) := min l hx) dx, S l). In other words, it is the distance from x to the set of points that are labeled differently from hx). The term width is appropriate since the functional value is just the geometric distance between x and the complement of S hx). Given h : X [C], for each k between 1 and C, we define f h k : X R by f h k x) = min l k dx, S l) dx, S k ), and we define f h : X R C by setting the kth component function of f h to be fk h : that is, f h ) k = fk h. Note that if hx) = k, then fk hx) 0 and f j h x) 0 for j k. The function f contains geometrical information encoding how definitive the classification of a point is: if fk h x) is a large positive number, then the point x belongs to category k and is a large distance from differently classified points. We will regard h as being in error on x X if fk h x) is not positive, where k = hx) that is, if the classification is not unambiguously revealed by f h ). The error er P h) of h can then be expressed in terms of the function f h : er P h) = P f h Y X) < 0 ). 2) We define the class F of functions as F := { f h x) : h H }. 3) Note that f h is a mapping from X to the bounded set [ diamx), diamx)] C R C. Henceforth, we will use γ > 0 to denote a width parameter whose value is in the range 0, diamx)]. For a positive width parameter γ > 0 and a training sample s, the empirical sample) γ-width error is defined as Es γ h) = 1 m ) I fy h m j x j ) γ. j=1
6 RRR Page 5 Here, IA) is the indicator function of the set, or event, A.) We shall also denote E γ s h) by E γ s f) where f = f h. Note that f h y x) γ min l y dx, S l) dx, S y ) γ l y such that dx, S l ) dx, S y ) + γ. So the empirical γ-width error on the sample is the proportion of points in the sample which are either misclassified by h or which are classified correctly, but lie within distance γ of the set of points classified differently. We recall that hx) = y implies dx, S y ) = 0.) Our aim is to show that with high probability) the generalization error er P h) is not much greater than E γ s h). In particular, as a special case, we want to bound the generalization error given that E γ s h) = 0.) This will imply that if the learner finds a hypothesis which, for a large value of γ, has a small γ-width error, then that hypothesis is likely to have small error. What this indicates, then, is that if a hypothesis has a large width on most points of a sample, then it will be likely to have small error. 4 Covering numbers 4.1 Covering numbers We will make use of some results and ideas from large-margin theory. One central idea in large-margin analysis is that of covering numbers. We will discuss different types of covering numbers, so we introduce the idea in some generality to start with. Suppose A, d) is a pseudo-)metric space and that α > 0. Then an α-cover of A with respect to d) is a finite subset C of A such that, for every a A, there is some c C such that da, c) α. If such a cover exists, then the mimimum cardinality of such a cover is the covering number N A, α, d). We are working with the set F of vector-valued functions from X to R C, as defined earlier. We define the sup-metric d on F as follows: for f, g : X R C, d f, g) = sup x X max f kx) g k x), 1 k C where f k denotes the kth component function of f. Note that each component function is bounded, so the metric is well-defined.) We can bound the covering numbers N F, α, d ) of F with respect to the sup-metric) in terms of the covering numbers of X with respect to its metric d. The result is as follows.
7 Page 6 RRR Theorem 4.1 For α 0, diamx)], ) CNα 9 diamx) N F, α, d ), α where N α = N X, α/3, d). 4.2 Smoothness of the function class As a first step towards establishing this result, we prove that the functions in F satisfy a certain Lipschitz or smoothness) property. Proposition 4.2 For every f F, and for all x, x X, max f kx) f k x ) 2 dx, x ). 4) 1 k C Proof: Let x, x X and fix k between 1 and C. We show that f k x) f k x ) 2 dx, x ). Recall that, since f F, there is some h : X [C] such that, for all x, f k x) = min l k dx, S l) dx, S k ) where, for each i, S i = h 1 i). We have f k x) f k x ) = min dx, S l) dx, S k ) min l k l k dx, S l ) + dx, S k ) min dx, S l) min l k l k dx, S l ) + dx, S k) dx, S k ) We consider in turn each of the two terms in this final expression. We start with the second. Let ɛ > 0 and suppose r S k is such that dx, r) < dx, S k ) + ɛ. Such an r exists since dx, S k ) = inf y Sk dx, y). Let s S k be such that dx, s) < dx, S k ) + ɛ. Then dx, S k ) dx, r) dx, s) + ɛ, since dx, r) < dx, S k ) + ɛ and dx, S k ) dx, s). So, dx, S k ) dx, s) + dx, x ) + ɛ < dx, S k ) + ɛ + dx, x ) + ɛ = dx, S k ) + 2ɛ.
8 RRR Page 7 A similar argument establishes dx, S k ) dx, s) dx, r) + ɛ dx, r) + dx, x ) + ɛ < dx, S k ) + dx, x ) + 2ɛ. Combining these two inequalities shows dx, S k ) dx, S k ) dx, x )+2ɛ. Since this holds for all ɛ > 0, it follows that dx, S k ) dx, S k ) dx, x ). Next we show min dx, S l) min l k l k dx, S l ) dx, x ). Suppose that min k l dx, S l ) = dx, S p ) and that min k l dx, S l ) = dx, S q ). Let ɛ > 0. Suppose that y S p is such that dx, y) < dx, S p ) + ɛ and that y S q is such that dx, y ) < dx, S q ) + ɛ. Then and dx, S q ) dx, S p ) dx, y) dx, x ) + dx, y) < dx, x ) + dx, S p ) + ɛ dx, S p ) dx, S q ) dx, y ) dx, x ) + dx, y ) = dx, x ) + dx, S q ) + ɛ. From these inequalities, it follows that dx, S p ) dx, S q ) dx, x ) + ɛ, for all ɛ, and it follows that dx, S p ) dx, S q ) dx, x ). Next, we exploit this smoothness to construct a cover for F. 4.3 Covering F Let the subset C γ X be a minimal size α/3-cover for X with respect to the metric d. So, for every x X there is some ˆx C γ such that dx, ˆx) α/3. Denote by N α the cardinality of C α. Let { } 3 diamx) 3 diamx) Λ α = λ i = iα : i =,..., 1, 0, 1, 2,..., α α and define the class ˆF to be all functions ˆf : C α Λ α ) C. Then ˆF is of a finite size equal to Λ α C Nα. For any ˆf ˆF define the extension ˆf ext : X R C of ˆf to the whole domain X as follows: given ˆf which is well-defined on the points ˆx i of the cover) then for every point x in the ball B α/3 ˆx i ) = {x X : dx, ˆx i ) α/3}, we let ˆf ext x) = ˆfˆx i ), for all ˆx i C α where, if, for a point x there is more than one point ˆx i such that x B α/3 ˆx i ), we arbitrarily pick one of the points ˆx i in order to assign the value of ˆf ext x)). There is a one-to-one correspondence 5)
9 Page 8 RRR between the functions ˆf and the functions ˆf ext. Hence the set ˆF ext = cardinality equal to Λ α C Nα. { ˆfext : ˆf ˆF } is of We claim that for any f F there exists an ˆf ext such that d f, ˆf ext ) α. To see this, first for every point ˆx i C α, consider fˆx i ) and find a corresponding element in Λ C α, call it ˆfˆx i )) such that max fˆx i)) k ˆfˆx i )) k α/3. 6) 1 k C That there exists such a value follows by design of Λ α.) By the above definition of extension, it follows that for all points x B α/3 ˆx i ) we have ˆf ext x) = ˆfˆx i ). Now, from 4) we have for all f F, max sup 1 i k x B α/3 ˆx i ) fx)) k fˆx i )) k 2dx, ˆx i ) 2α/3. 7) Hence for any f F there exists a function ˆf ˆF with a corresponding ˆf ext ˆF ext such that, given an x X, there exists ˆx i C α such that, for each k between 1 and C, fx)) k ˆf ext x)) k = fx)) k ˆf ext ˆx i )) k. The right hand side can be expressed as fx)) k ˆf ext ˆx i )) k = fx)) k ˆfˆx i )) k where 8) follows from 6) and 7). = fx)) k fˆx i )) k + fˆx i )) k ˆfˆx i )) k fx)) k fˆx i )) k + fˆx i )) k ˆfˆx i )) k 2α/3 + α/3 8) = α. Hence the set ˆF ext forms an α-covering of the class F in the sup-norm. Thus we have the following covering number bound. C Nα 3 diamx) N F, α, d ) Λ α C Nα = 2 + 1). 9) α Theorem 4.1 now follows because for 0 < α diamx)) ) 3 diamx) 3 diamx) = 6 diamx) α α α diamx). α 5 Generalization error bounds We present two results. The first bounds the generalization error in terms of a width parameter γ for which the γ-width error on the sample is zero; the second more general but looser
10 RRR Page 9 in that special case) bounds the error in terms of γ and the γ-width error on the sample which could be non-zero). Theorem 5.1 Suppose that X is a metric space of diameter diamx). Suppose P is any probability measure on Z = X [C]. Let δ 0, 1). Then, with probability at least 1 δ, the following holds for s Z m : for any function h : X [C], and for any γ 0, diamx)], if Es γ h) = 0, then er P h) 2 ) )) 36 diamx) 4 diamx) CN X, γ/12, d) log m 2 + log γ 2. δγ Theorem 5.2 Suppose that X is a metric space of diameter diamx). Suppose P is any probability measure on Z = X [C]. Let δ 0, 1). Then, with probability at least 1 δ, the following holds for s Z m : for any function h : X [C], and for any γ 0, diamx)], er P h) E γ s h) + 2 m CN X, γ/6, d) ln ) )) 18 diamx) 2 diamx) + ln + 1 γ γδ m. What we have in Theorem 5.2 is a high probability bound that takes the following form: for all h and for all γ 0, diamx)], er P h S ) E γ s h) + ɛm, γ, δ), where ɛ tends to 0 as m and ɛ decreases as γ increases. The rationale for seeking such a bound is that there is likely to be a trade-off between width error on the sample and the value of ɛ: taking γ small so that the error term E γ s h) is zero might entail a large value of ɛ; and, conversely, choosing γ large will make ɛ relatively small, but lead to a large sample error term. So, in principle, since the value γ is free to be chosen, one could optimize the choice of γ on the right-hand side of the bound to minimize it. Proof of Theorem 5.1 The proof uses techniques similar to those first used in [15, 14, 8, 11] and in subsequent work extending those techniques to learning with real-valued functions, such as [10, 3, 1, 6]. The first observation is that if Q = {s Z m : h H with E γ s h) = 0, er P h) ɛ} and T = {s, s ) Z m Z m : h H with Es γ h) = 0, Es 0 h) ɛ/2},
11 Page 10 RRR then, for m 8/ɛ, P m Q) 2 P 2m T ). This is because P 2m T ) P { 2m h : Es γ h) = 0, er P h) ɛ and Es 0 h) ɛ/2}) = P { m s : h, Es γ h) = 0, er P h) ɛ and Es 0 h) ɛ/2}) dp m s) Q 1 2 P m Q), for m 8/ɛ. The final inequality follows from the fact that if er P h) ɛ, then for m 8/ɛ, P m Es 0 h) ɛ/2) 1/2, for any h, something that follows by a Chernoff bound. Let G be the permutation group the swapping group ) on the set {1, 2,..., 2m} generated by the transpositions i, m + i) for i = 1, 2,..., m. Then G acts on Z 2m by permuting the coordinates: for σ G, By invariance of P 2m under the action of G, σz 1, z 2,..., z 2m ) = z σ1),..., z σ2m) ). P 2m T ) max{pσz T ) : z Z 2m }, where P denotes the probability over uniform choice of σ from G. See, for instance, [11, 2].) Let F = {f h : h H} be the set of vector-valued functions derived from H as before, and let ˆF be a minimal γ/2-cover of F in the d -metric. Theorem 4.1 tells us that the size of ˆF is no more than ) CN 18 diamx), where N = N X, γ/6, d). γ Fix z Z 2m. Suppose τz = s, s ) T and that h is such that Es γ h) = 0 and Es 0 h) ɛ/2. Let ˆf ˆF be such that d ˆf, f h ) γ/2. Then the fact that Es γ h) = 0 implies that ˆf) = 0. And, the fact that Es 0 h) ɛ/2 implies that Eγ/2 s ˆf) ɛ/2. To see the first claim, note that we have fy h x) > γ for all x, y) in the sample s and that, since, for all k, and all x, ˆf k x) fk hx) γ/2, we have also that for all such x, y), ˆfy x) > γ/2. For the second claim, we observe that if fy h x) 0 then ˆf y x) γ/2. E γ/2 s It now follows that if τz T, then, for some ˆf ˆF, τz R ˆf), where R ˆf) = {s, s ) Z m Z m : Es γ/2 ˆf) = 0, E γ/2 s ˆf) ɛ/2}. By symmetry, P σz R ˆf) ) = P στz) R ˆf) ). Suppose that E γ/2 s ˆf) = r/m, where r ɛm/2 is the number of x i, y i ) in s on which ˆf is such that ˆf yi x i ) γ/2. Then those
12 RRR Page 11 permutations σ such that στz) R ˆf) are precisely those that do not transpose these r coordinates, and there are 2 m r 2 m ɛm/2 such σ. It follows that, for each fixed ˆf ˆF, P σz R ˆf) ) 2m1 ɛ/2) G = 2 ɛm/2. We therefore have Pσz T ) P σz ˆf R ˆf) ˆf Pσz R ˆf)) ˆF 2 ɛm/2. ˆF ˆF So, ) CN P m Q) 2 P 2m T ) 2 ˆF 18 diamx) 2 ɛm/2 2, γ where N = N X, γ/6, d). This is at most δ when ɛ = 2 m ) )) 18 diamx) 2 CN log 2 + log γ 2. δ Next, we use this to obtain a result in which γ is not prescribed in advance. For α 1, α 2, δ 0, 1), let Eα 1, α 2, δ) be the set of z Z m for which there exists some h H with E α 2 z h) = 0 and er P h) ɛ 1 m, α 1, δ), where ɛ 1 m, α 1, δ) = 2 m ) )) 18 diamx) 2 CN X, α 1 /6, d) log 2 + log α 2. 1 δ Then the result just obtained tells us that P m Eα, α, δ)) δ. It is also clear that if α 1 α α 2 and δ 1 δ, then Eα 1, α 2, δ 1 ) Eα, α, δ). It follows from a result from [6], that P m Eγ/2, γ, δγ/2 diamx))) δ. γ 0,diamX)] In other words, with probability at least 1 δ, for all γ 0, diamx)]], we have that if h H satisfies E γ s h) = 0, then er P h) < ɛ 2 m, γ, δ), where ɛ 2 m, γ, δ) = 2 m ) )) 36 diamx) 4 diamx) CN X, γ/12, d) log 2 + log γ 2. δγ Note that γ now need not be prescribed in advance.
13 Page 12 RRR Proof of Theorem 5.2 Guermeur [9] has developed a framework in which to analyse multi-category classification, and we can apply one of his results to obtain the bound of Theorem 5.2, which is generalization error bound applicable to the case in which the γ-width sample error is not zero. In that framework, there is a set G of functions from X into R C, and a typical g G is represented by its component functions g k for k = 1 to C. Each g G satisfies the constraint g k x) = 0, x X. k=1 A function of this type acts as a classifier as follows: it assigns category l [C] to x X if and only if g l x) > max k l g k x). If more than one value of k maximizes g k x), then the classification is left undefined, assigned some value not in [C].) The risk of g G, when the underlying probability measure on X Y is P, is defined to be Rg) = P For v, k) R C [C], let Mv, k) = 1 2 X R C given by Given a sample s X [C]) m, let ) {x, y) X [C] : g y x) max g kx)}. k y ) v k max v l l k gx) = g k x)) C k=1 = Mgx), k)) C k=1. and, for g G, let g be the function R γ,s g) = 1 m m I { g yi x i ) < γ}. i=1 A result following from [9] is in the above notation) as follows: Let δ 0, 1) and suppose P is a probability measure on Z = X [C]. With P m -probability at least 1 δ, s Z m will be such that we have the following: for any fixed d > 0) for all γ 0, d] and for all g G, Rg) R γ,s g) + 2 m ln N G, γ/4, d ) + ln )) 2d + 1 γδ m. In fact, the result from [9] involves emprical covering numbers rather than d -covering numbers. The latter are at least as large as the empirical covering numbers, but we use
14 RRR Page 13 these because we have bounded them earlier in this paper.) For each function h : X [C], let g = g h be the function X R C defined by g k x) = 1 dx, S i ) dx, S k ), C i=1 where, as before, S j = h 1 j). Let G = {g h : h H} be the set of all such g. Then these functions satisfy the constraint that their coordinate functions sum to the zero function, since g k x) = k=1 For each k, k=1 1 C dx, S i ) i=1 dx, S k ) = k=1 dx, S k ) k=1 g k x) = Mgx), k) = 1 ) g k x) max 2 g lx) l k = dx, S i ) dx, S k ) max 2 C l k C i=1 = 1 ) dx, S k ) max 2 dx, S l)) l k = 1 ) min 2 dx, S l) dx, S k ) l k = 1 2 f h k x). dx, S k ) = 0. k=1 )) dx, S i ) dx, S l ) From the definition of g, the event that g y x) max k y g k x) is equivalent to the event that ) 1 1 dx, S i ) dx, S y ) max dx, S i ) dx, S k ), C k y C i=1 which is equivalent to min k y dx, S k ) dx, S y ). It can therefore be seen that Rg) = er P h). Similarly, R γ,s g) = E γ s h). Noting that G = 1/2)F, so that an α/2 cover of F will provide an α/4 cover of G, we can therefore apply Guermeur s result to see that with probability at least 1 δ, for all h and for all γ 0, diamx)], er P h) E γ s h) + E γ s h) + 2 m 2 m CN X, γ/6, d) ln i=1 ln N F, γ/2, d ) + ln i=1 )) 2 diamx) + 1 γδ m ) )) 18 diamx) 2 diamx) + ln + 1 γ γδ m.
15 Page 14 RRR Acknowledgements This work was supported in part by the IST Programme of the European Community, under the PASCAL2 Network of Excellence, IST References [1] M. Anthony and P. L. Bartlett. Function learning from interpolation. Combinatorics, Probability, and Computing, 9: , [2] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, [3] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM, 444):615631, [4] M. Anthony and J. Ratsaby. Maximal width learning of binary functions. Theoretical Computer Science, 411: , [5] M. Anthony and J. Ratsaby. Learning on finite metric spaces. RUTCOR Research Report RRR , June [6] P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE Transactions on Information Theory, 442): , [7] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 364): , [8] R. M. Dudley 1999). Uniform Central Limit Theorems, Cambridge Studies in Advanced Mathematics, 63, Cambridge University Press, Cambridge, UK. [9] Yann Guermeur. VC theory of large margin multi-category classifiers. Journal of Machine Learning Research, 8, , [10] D. Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications, Information and Computation, 100: , [11] D. Pollard 1984). Convergence of Stochastic Processes. Springer-Verlag. [12] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 445), 1996:
16 RRR Page 15 [13] A. J. Smola, P. L. Bartlett, B. Scholkopf, and D. Schuurmans. Advances in Large-Margin Classifiers Neural Information Processing). MIT Press, [14] V. N. Vapnik. Statistical Learning Theory. Wiley, [15] V.N. Vapnik and A.Y. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 162): , 1971.
Using boxes and proximity to classify data into several categories
R u t c o r Research R e p o r t Using boxes and proximity to classify data into several categories Martin Anthony a Joel Ratsaby b RRR 7-01, February 01 RUTCOR Rutgers Center for Operations Research Rutgers
More informationThe Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors
R u t c o r Research R e p o r t The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony a Joel Ratsaby b RRR 17-2011, October 2011 RUTCOR Rutgers Center for Operations
More informationMaximal Width Learning of Binary Functions
Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering
More informationThe Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors
The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony Department of Mathematics London School of Economics and Political Science Houghton Street, London WC2A2AE
More informationMaximal Width Learning of Binary Functions
Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A 2AE United Kingdom m.anthony@lse.ac.uk Joel Ratsaby Ben-Gurion
More informationA Result of Vapnik with Applications
A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of
More informationR u t c o r Research R e p o r t. Large margin case-based reasoning. Martin Anthony a. Joel Ratsaby b. RRR , February 2013
R u t c o r Research R e p o r t Large margin case-based reasoning Martin Anthony a Joel Ratsaby b RRR 2-2013, February 2013 RUTCOR Rutgers Center for Operations Research Rutgers University 640 Bartholomew
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationLarge width nearest prototype classification on general distance spaces
Large width nearest prototype classification on general distance spaces Martin Anthony 1 and Joel Ratsaby 2 1 Department of Mathematics, London School of Economics and Political Science, Houghton Street,
More informationUniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning
Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A
More informationStatistical and Computational Learning Theory
Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationComputational Learning Theory for Artificial Neural Networks
Computational Learning Theory for Artificial Neural Networks Martin Anthony and Norman Biggs Department of Statistical and Mathematical Sciences, London School of Economics and Political Science, Houghton
More informationOn the Sample Complexity of Noise-Tolerant Learning
On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationA note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L
More informationNeural Network Learning: Testing Bounds on Sample Complexity
Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do
More informationMartin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.
Function Learning from Interpolation 1 Martin Anthony Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. Email anthony@vax.lse.ac.uk
More informationOn Learnability, Complexity and Stability
On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly
More informationDiscriminative Learning can Succeed where Generative Learning Fails
Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,
More informationThe Decision List Machine
The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationTHE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY
THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential
More informationThe information-theoretic value of unlabeled data in semi-supervised learning
The information-theoretic value of unlabeled data in semi-supervised learning Alexander Golovnev Dávid Pál Balázs Szörényi January 5, 09 Abstract We quantify the separation between the numbers of labeled
More informationVapnik-Chervonenkis Dimension of Neural Nets
P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley
More informationRelating Data Compression and Learnability
Relating Data Compression and Learnability Nick Littlestone, Manfred K. Warmuth Department of Computer and Information Sciences University of California at Santa Cruz June 10, 1986 Abstract We explore
More informationGeneralization theory
Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationVapnik-Chervonenkis Dimension of Neural Nets
Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley Department of Statistics 367 Evans Hall, CA 94720-3860, USA bartlett@stat.berkeley.edu
More informationComputational Learning Theory
CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationBennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence
Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present
More informationA Large Deviation Bound for the Area Under the ROC Curve
A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu
More informationA Pseudo-Boolean Set Covering Machine
A Pseudo-Boolean Set Covering Machine Pascal Germain, Sébastien Giguère, Jean-Francis Roy, Brice Zirakiza, François Laviolette, and Claude-Guy Quimper Département d informatique et de génie logiciel, Université
More informationR u t c o r Research R e p o r t. A New Imputation Method for Incomplete Binary Data. Mine Subasi a Martin Anthony c. Ersoy Subasi b P.L.
R u t c o r Research R e p o r t A New Imputation Method for Incomplete Binary Data Mine Subasi a Martin Anthony c Ersoy Subasi b P.L. Hammer d RRR 15-2009, August 2009 RUTCOR Rutgers Center for Operations
More information10.1 The Formal Model
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationCSE 648: Advanced algorithms
CSE 648: Advanced algorithms Topic: VC-Dimension & ɛ-nets April 03, 2003 Lecturer: Piyush Kumar Scribed by: Gen Ohkawa Please Help us improve this draft. If you read these notes, please send feedback.
More informationLearning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More informationE0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)
E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how
More informationA Bound on the Label Complexity of Agnostic Active Learning
A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,
More informationOnline Learning: Random Averages, Combinatorial Parameters, and Learnability
Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationComputational Learning Theory. Definitions
Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational
More informationCOMPUTATIONAL LEARNING THEORY
COMPUTATIONAL LEARNING THEORY XIAOXU LI Abstract. This paper starts with basic concepts of computational learning theory and moves on with examples in monomials and learning rays to the final discussion
More informationSupport Vector Machines
Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6
More informationCS 6375: Machine Learning Computational Learning Theory
CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationRademacher Averages and Phase Transitions in Glivenko Cantelli Classes
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationComputational Learning Theory
Computational Learning Theory Sinh Hoa Nguyen, Hung Son Nguyen Polish-Japanese Institute of Information Technology Institute of Mathematics, Warsaw University February 14, 2006 inh Hoa Nguyen, Hung Son
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More informationComputational Learning Theory
09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationBaum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions
Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions Adam R Klivans UT-Austin klivans@csutexasedu Philip M Long Google plong@googlecom April 10, 2009 Alex K Tang
More informationSupport Vector Machine Classification via Parameterless Robust Linear Programming
Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified
More informationYale University Department of Computer Science. The VC Dimension of k-fold Union
Yale University Department of Computer Science The VC Dimension of k-fold Union David Eisenstat Dana Angluin YALEU/DCS/TR-1360 June 2006, revised October 2006 The VC Dimension of k-fold Union David Eisenstat
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationModels of Language Acquisition: Part II
Models of Language Acquisition: Part II Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 Probably Approximately Correct Model of Language Learning General setting of Statistical
More informationOn A-distance and Relative A-distance
1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY 14853 On A-distance and Relative A-distance Ting He and Lang Tong Technical Report No. ACSP-TR-08-04-0 August 004
More informationSynthesis of Maximum Margin and Multiview Learning using Unlabeled Data
Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data Sandor Szedma 1 and John Shawe-Taylor 1 1 - Electronics and Computer Science, ISIS Group University of Southampton, SO17 1BJ, United
More informationMACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti
MACHINE LEARNING Vapnik-Chervonenkis (VC) Dimension Alessandro Moschitti Department of Information Engineering and Computer Science University of Trento Email: moschitti@disi.unitn.it Computational Learning
More information1 Differential Privacy and Statistical Query Learning
10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose
More informationEfficient classification for metric data
Efficient classification for metric data Lee-Ad Gottlieb Weizmann Institute of Science lee-ad.gottlieb@weizmann.ac.il Aryeh Leonid Kontorovich Ben Gurion University karyeh@cs.bgu.ac.il Robert Krauthgamer
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationComputational Learning Theory
Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8] Computational learning theory Setting 1: learner poses queries to teacher Setting 2: teacher chooses examples Setting
More information1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015
10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning
More informationData-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2
Data-Dependent Structural Risk Minimisation for Perceptron Decision Trees John Shawe-Taylor Royal Holloway, University of London 1 Email: jst@dcs.rhbnc.ac.uk Nello Cristianini University of Bristol 2 Email:
More informationComputational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar
Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationMACHINE LEARNING. Support Vector Machines. Alessandro Moschitti
MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines
More informationComputational Learning Theory
0. Computational Learning Theory Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 7 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Main Questions
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationVC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.
VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about
More informationA Necessary Condition for Learning from Positive Examples
Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)
More informationHence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all
1. Computational Learning Abstract. I discuss the basics of computational learning theory, including concept classes, VC-dimension, and VC-density. Next, I touch on the VC-Theorem, ɛ-nets, and the (p,
More informationIntroduction to Machine Learning
Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE
More informationActive Learning and Optimized Information Gathering
Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office
More informationOn Efficient Agnostic Learning of Linear Combinations of Basis Functions
On Efficient Agnostic Learning of Linear Combinations of Basis Functions Wee Sun Lee Dept. of Systems Engineering, RSISE, Aust. National University, Canberra, ACT 0200, Australia. WeeSun.Lee@anu.edu.au
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem
More informationLearning with Decision Lists of Data-Dependent Features
Journal of Machine Learning Research 6 (2005) 427 451 Submitted 9/04; Published 4/05 Learning with Decision Lists of Data-Dependent Features Mario Marchand Département IFT-GLO Université Laval Québec,
More informationThe Set Covering Machine with Data-Dependent Half-Spaces
The Set Covering Machine with Data-Dependent Half-Spaces Mario Marchand Département d Informatique, Université Laval, Québec, Canada, G1K-7P4 MARIO.MARCHAND@IFT.ULAVAL.CA Mohak Shah School of Information
More informationLecture 29: Computational Learning Theory
CS 710: Complexity Theory 5/4/2010 Lecture 29: Computational Learning Theory Instructor: Dieter van Melkebeek Scribe: Dmitri Svetlov and Jake Rosin Today we will provide a brief introduction to computational
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationFormulation with slack variables
Formulation with slack variables Optimal margin classifier with slack variables and kernel functions described by Support Vector Machine (SVM). min (w,ξ) ½ w 2 + γσξ(i) subject to ξ(i) 0 i, d(i) (w T x(i)
More informationBounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective
Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective Hubert Haoyang Duan University of Ottawa hduan065@uottawa.ca June 23, 2012 Abstract: The
More informationAn introduction to Support Vector Machines
1 An introduction to Support Vector Machines Giorgio Valentini DSI - Dipartimento di Scienze dell Informazione Università degli Studi di Milano e-mail: valenti@dsi.unimi.it 2 Outline Linear classifiers
More informationA Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes
A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany
More informationMulticlass Learnability and the ERM principle
Multiclass Learnability and the ERM principle Amit Daniely Sivan Sabato Shai Ben-David Shai Shalev-Shwartz November 5, 04 arxiv:308.893v [cs.lg] 4 Nov 04 Abstract We study the sample complexity of multiclass
More informationPAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers
PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,
More information