Large width nearest prototype classification on general distance spaces

Size: px
Start display at page:

Download "Large width nearest prototype classification on general distance spaces"

Transcription

1 Large width nearest prototype classification on general distance spaces Martin Anthony 1 and Joel Ratsaby 2 1 Department of Mathematics, London School of Economics and Political Science, Houghton Street, London WC2A2AE, U.K. 2 Department of Electrical and Electronics Engineering, Ariel University of Samaria, ARIEL 40700, ISRAEL Abstract In this paper we consider the problem of learning nearest-prototype classifiers in any finite distance space; that is, in any finite set equipped with a distance function. An important advantage of a distance space over a metric space is that the triangle inequality need not be satisfied, which makes our results potentially very useful in practice. We consider a family of binary classifiers for learning nearest-prototype classification on distance spaces, building on the concept of large-width learning which we introduced and studied in earlier works. Nearest-prototype is a more general version of the ubiquitous nearest-neighbor classifier: a prototype may or may not be a sample point. One advantage in the approach taken in this paper is that the error bounds depend on a width parameter, which can be sample-dependent and thereby yield a tighter bound. 1 Introduction Learning Vector Quantization (LVQ) and its various extensions introduced by Kohonen [21] are used successfully in many machine learning tools and applications. Learning pattern classification by LVQ is based on adapting a fixed set of labeled prototypes in Euclidean space and using the resulting set of prototypes in a nearest-prototype rule (winner-take-all) to classify any point in the input space. As [20] mentions, LVQ fails if Euclidean representation is not well-suited for the data; and there have been extensions of LVQ to try to allow di erent metrics [20, 24] and take advantage of samples for which a more confident (or a large margin) classification can be obtained. Generalization error bounds with dependence on this sample margin are stated in [20, 24] for learning over Euclidean spaces and, as is usually the case for large-margin learning [1], the error bounds are tighter than ones with no sample-margin dependence. The results of such work are important as they explain why LVQ works well in practice in Euclidean metric spaces. 1

2 2 1 INTRODUCTION There are learning domains in which it is di cult to formalize quantitative features that are encoded as numerical variables which together constitute a vector space (usually Euclidean) to discriminate between objects that belong to di erent classes [22]. Learning over such domains requires a qualitative approach which tries to describe the structure of objects in a way that is similar to how humans do, for instance, in terms of morphological elements of objects. Objects are then represented not by numerical vectors but by other means such as strings of symbols which can be compared using a dissimilarity (or distance) function. This approach is much more flexible than the one based on numerical features since there are many existing distance functions [18] and new ones can be defined easily for any kind of objects, for instance, bioinformatic sequences, graphs, images, etc., and they do not have to satisfy the requirements of a metric. However, most learning algorithms, in particular neural networks which have been very succesfull recently, require a Euclidean, or more generally vector spaces, that are represented by numerical features. Such problem domains are potential applications of prototype-based learning over non-euclidean, or more generally, non-metric spaces. In [3] we studied learning binary classification with nearest-prototype classifiers over metric spaces and obtained sample-dependent error bounds. In the current paper we consider learning binary classification on finite distance spaces; that is, finite sets equipped with adistancefunction(oftencalled dissimilaritymeasure [18])wheretheclassifiers(which we call nearest-prototype classifiers) are generalizations of the well known nearest neighbor classifier [14]. An important advantage of a distance space over a metric space is that the triangle inequality need not be satisfied, which makes our results potentially very useful in practice. Our definition of distance function is quite loose in that it does not need to satisfy any of the non-negativity, symmetry or reflexivity properties of a proper distance function [18]. We still call it a distance because, as far as we can expect in applying our learning results, any useful space has at least the non-negativity property and so we will assume in the paper that the distance function satisfies the non-negativity property. We consider a family of binary classifiers for learning nearest-prototype classification on distance spaces, building on the concept of large-width learning which was introduced in [4] and expanded in various classification settings [5 10]. The advantage in this approach is that the error bounds depend on the width parameter which can be sample-dependent and thereby yield a tighter error bound. We define a width function which measures the di erence between the distance from a test point x (to be classified) to its nearest negative prototype and the distance to its nearest positive prototype. The classifier s decision is defined as the sign of this di erence. The set of prototypes from which these two nearest ones are obtained is very general in that it can be any set of points in the distance space. In particular, it can be a subset of the sample and can be determined via any algorithm. The error bounds that we state in the current paper apply regardless of the algorithm that is used to determine these prototypes. The fact that we deal with a distance space, rather than a metric space, means that the triangle inequality need not be satisfied. The error bound depends on quantities that can be evaluated directly from a matrix which consists of the half-space functions over the distance space. This matrix is a function of the distance matrix of the space and hence it can be computed e ciently using massively parallel processing techniques, for instance, as in [11, 12]. Also, as mentioned above, our use of the concept of distance is loose, so that, for instance,

3 3 not only that the triangle inequality need not be satisfied, but also none of the three standard properties of a distance need to be satisfied either. 2 Setup 2.1 Nearest-prototype classifiers For a positive integer n, let[n] :={1, 2,...,n}. WeconsiderafinitesetX := {x 1,...,x N } with a binary set Y = { 1, 1} of possible classifications. Let d, a distancefunction,bea function from X Xto R. Let us assume that d is normalized such that diam (X ):= max d(x i,x j )=1. 1applei,jappleN A prototype p i 2X is a point in the distance space that has an associated label denote by i 2Y.We p + i,p j aprototypewithlabel i = 1andaprototypewithlabel j = 1, respectively. When the label of a prototype is not explicitly mentioned we write p i. Given a fixed integer n 2, we consider learning the family of classifiers that are defined by the nearest-neighbor rule defined by a set of n prototypes. We refer to any such classifier by h R, and it is defined by an ordered set of prototypes R = {p i } n i=1 and their corresponding label vector T := [ 1,..., n], i 2Y,1appleiapplen. The order of the set R can be defined based on any ordering of X and it allows us to fix in advance and consider all classifiers obtained by all ordered sets R. We define h R, to be a nearest-prototype classifier as follows: let N + = N + ( ):={i 2 [n] : i =1},N = N ( ):={i 2 [n] : i = 1}. Then, given x, ( 1 if argmin h R, (x) = 1appleiapplen d(x, p i ) 2 N 1 otherwise. Let us denote by asampleofm labeled examples (2.1) := {(X i,y i )} m i=1. (2.2) In the current paper, a prototype p i may be any point in X ;inparticular,onethatdepends on the sample directly or via some learning algorithm. For instance, if (R, )=, then h R, is the well known nearest-neighbor classifier [14]. If the labeled prototype set is only asubsetofthesample,(r, ), thenh R, belongs to a family of the so-called edited nearest-neighbor classifiers [17]. The prototype set (R, )maynotnecessarilybeasubset of the sample, but could be derived from it via some adaptive procedure such as the LVQ algorithm [21], in which case h R, would be the LVQ classifier. In the current paper, any such classifier is referred to as a nearest-prototype classifier and, as mentioned above, is denoted by h R,.

4 4 2 SETUP 2.2 Probabilistic modeling of learning We work in the framework of the popular PAC model of computational learning theory (see [13, 27]). This model assumes that the labeled examples (X i,y i )inthetrainingsample have been generated randomly according to some fixed (but unknown) probability distribution P on Z = X Y. (This includes, as a special case, the situation in which each X i is drawn according to a fixed distribution on X and is then labeled deterministically by Y i = t(x i )wheret is some fixed function.) Thus, a sample (2.2) of length m can be regarded as being drawn randomly according to the product probability distribution P m.ingeneral, suppose that H is a set of functions from X to { 1, 1}. An appropriate measure of how well h 2 H would perform on further randomly drawn points is its error, er P (h), the probability that h(x) 6= Y for random (X, Y )whichcanbeexpressedas er P (h) =P (h(x) 6= Y )=P (Yh(X) < 0). (2.3) Given any function h 2 H, we can measure how well h matches the training sample through its sample error er (h) = 1 m {i : h(x i) 6= Y i } (the proportion of points in the sample incorrectly classified by h). Much classical work in learning theory (see [13, 27], for instance) related the error of a classifier h to its sample error. A typical result would state that, for all 2 (0, 1), with probability at least 1,forallh 2 H we have er P (h) < er (h)+ (m, ), where (m, )(knownasageneralization error bound) is decreasing in m and. Such results can be derived using uniform convergence theorems from probability theory [19, 23, 28], in which case (m, )wouldtypicallyinvolveaquantityknown as the growth function of the set of classifiers [1, 13, 27, 28]. More recently, emphasis has been placed on learning with a large margin. (See, for instance [1, 2, 25, 26].) The rationale behind margin-based, or width-based generalization error bounds is that if a classifier has managed to achieve a wide separation between the points of di erent classification, then this indicates that it is a good classifier, and it is possible that a better generalization error bound can be obtained. Margin-based results apply when the classifiers are derived from real-valued function by thresholding (taking their sign). A more direct approach which does not require real-valued functions as a basis for classification margin, uses the concept of width (introduced in [4]) and studied in various settings in [5 10]. 2.3 Width and error of a classifier We define the width of h R, at a point x2 X as follows: w hr, (x) := min 1appleiapplen: i6=h(x) d (x, p i) min 1appleiapplen: i=h(x) d (x, p i). (2.4) In words, the width of h at x is the di erence between the distance to the nearest-unlikeprototype of x and the distance to the nearest-prototype to x where unlike means of a di erent sign than x. In [10] we consider binary classifiers that are based on a pair of oppositely labeled prototypes and use this definition of width, which, in this case, becomes simply d(x, p ) d(x, p + ).

5 2.3 Width and error of a classifier 5 The signed width (or margin) function corresponding to (2.4) is defined as f R, (x) :=f hr, (x) =h R, (x)w hr, (x). (2.5) Note that this definition means that for x equidistant from two oppositely labeled prototypes p, q 2 R that are each the closest to x from all other prototypes of the same label, the value of the margin f R, (x) atthisx is zero. This definition is intuitive and actually makes the analysis simpler compared to an alternative definition of width [7]. This definition of width is an application of a more general definition of width, introduced in [5, 6], which takes the form f(x) =d(x, S ) d(x, S + ), where S and S + are any disjoint subsets of the input space that are labeled 1and1,respectively. (In[7 9],aslightly di erent definition of width was used where the union of the disjoint sets S and S + equals the input space.) In [3] we considered binary classifiers which are also based on prototypes where the decision is not based on the nearest-prototype but is based on the combined influence of several prototypes based on certain regions of influence. The present notion of width was not explicitly utilized there. For a positive margin parameter -margin error is defined as > 0 and a training sample, the empirical (sample) ˆP m (Yf R, (X) apple )= 1 m mx I (Y j f R, (X j ) apple ). j=1 (Here, I(A) istheindicatorfunctionoftheset, orevent, A.) Define the function sgn(a) := ( 1 if a>0 1 if a apple 0. For the purpose of bounding the generalization error it is convenient to express the classification h(x) intermsofthesignedwidthasfollows, h R, (X) =sgn(f R, (X)). Therefore the generalization error er P (h R, )canbeboundedasfollows er P (h R, )=P (h R, (X) 6= Y ) (2.6) = P (Yf R, (X) < 0) + P (Y =1,f R, (X) =0) apple P (Yf R, (X) apple 0). (2.7) Our aim is to show that the upper bound (2.7) on the generalization misclassification error is not much greater than ˆP m (Yf R, (X) apple ). Explicitly, we aim for bounds of the form: for all 2 (0, 1), with probability at least 1,forall 2 (0, diam(x )], we have er P (h R, ) apple ˆP m (Yf R, (X) apple )+ (m,, ).

6 6 3 TOWARDS BOUNDING THE PROBABILITY This will imply that if the learner finds a hypothesis which, for a large value of small -margin error, then that hypothesis is likely to have small error.,hasa The advantage of working with the notion of width is that it is possible to have such a uniform bound over a very large family of classifiers. For instance, in [7] we obtained such bounds for learning the family of all possible binary classifiers on any finite metric space and in [8] we did the same for multi-category classifiers over infinite metric spaces. As mentioned above, in the current work, we consider particular kinds of classifiers h R, that are defined on the nearest-prototype rule based on a fixed number n of prototypes. Thus we expect that the bound that we obtain is tighter than the one in [7], which holds for the family of all binary classifiers. To obtain a uniform bound, we are interested in showing that the probability of the bad event namely, that there exists some value of and some classifier h R, such that the generalization error is not bounded from above by some small deviation from the empirical -margin error is small. That is, we aim to bound the following failure probability: P m X,Y ( : 9, 9, 9R, P (Yf R, (X) apple 0) > 1 m mx I {Y j f R, (X j ) apple j=1 } + )!. (2.8) This can be expressed as follows: ( P m X,Y : 9, 9, 9R, P (Yf R, (X) > 0) < 1 m )! mx I {Y j f R, (X j ) > }. (2.9) j=1 Let us fix P m X,Y for now, and deal with bounding the probability ( : 9, 9R, P (Yf R, (X) > 0) < 1 m )! mx I {y j f R, (x j ) > }. (2.10) j=1 3 Towards bounding the probability 3.1 Representing the bad event by related sets Define the set M R,, X Y as follows, M R,, := {(x, y) : yf R, (x) > } and let M + R,, := {(x, 1) : f R, (x) > } (3.1) M R,, := {(x, 1) : f R, (x) < }. (3.2)

7 3.1 Representing the bad event by related sets 7 Note that M R,, = nm R,, \ {(x, y) :y =1} o [ nm R,, \ {(x, y) :y = 1} o = M + R,, [ MR,,. (3.3) We can write and 1 m P (Yf R, (X) > )=P (M R,, ) mx I {Y j f R, (X j ) > } = P m (M R,, ), j=1 where P m denotes the empirical measure based on a sample of length m. expressed as PX,Y m ({ : 9, 9, 9R, P (M R,,0 ) <P m (M R,, ) }). Let ( ) be any function depending on E( ) (X Y) m as Thus (2.9) is, in a way to be specified later and define the set E( ):={ : 9 9R, P (M R,,0 ) <P m (M R,, ) ( )}. (3.4) Then substituting ( )for in (2.10) implies that (2.10) equals the probability PX,Y m (E( )). It follows that (2.8) equals 0 [ E( ) A. (3.5) P m X,Y 2(0,diam(X )] Let L be an integer (to be specified in a section further below). For integer 0 apple l apple L +1, let l be a decreasing sequence such that the following conditions hold: 1. 0 apple l apple =1, L+1 =0. Denote by While all the above quantities L, implicit in the notation. LX C := l. l=1 l and C may depend on X,we keep this dependence Define l := ( l, l 1] for1apple l apple L +1. Then(3.5)equals!! L+1 [ [ XL+1 [ PX,Y m E( ) apple PX,Y m E( ). (3.6) 2 l 2 l l=1 Define the set E l (X Y) m as E l := { : 9 9R, P (M R,, l ) <P m (M R,, l ) ( l 1 )}. Henceforth, assume that ( ) is a non-increasing function over each interval l. l=1

8 8 3 TOWARDS BOUNDING THE PROBABILITY Proposition 1. For any 2 l, E( ) E l. Proof. We have M R,,0 M R,, l ;thusp (M R,,0 ) P (M R,, l ). And M R,, l M R,, since l apple.thereforep m (M R,, ) apple P m (M R,, l ). For apple l 1,bytheaboveassumptionon, ( ) ( l 1 ). It follows that E( ) E l. The event that there exists and R such that P (M R,, l ) <P m (M R,, l ) ( l 1 )holds, together with (3.3), implies that either of the following events occurs: there exists and R such that P M + R,, l <P m M + R,, l ( l 1 )/2 or there exists a and R such that P M R,, l <P m M R,, l ( l 1 )/2. Let and Then E + l := : 9 9R, P M + R,, l <P m M + R,, l ( l 1 )/2 (3.7) E l := : 9 9R, P M R,, l <P m M R,, l ( l 1 )/2. (3.8) P m X,Y (E l ) apple P m X,Y E + l + P m X,Y E l. 3.2 Bounding the probability in terms of the growth function We now aim to bound from above the first probability P m X,Y E + l. We briefly first recall the definitions of growth function and VC-dimension [28]. Suppose that C is a collection of subsets of a set Z. LetS be any (finite) subset of Z. Thenadichotomy of S by C is a set of the form S \ C where C 2C.WedenotethenumberofdichotomiesofS by C as #(C; S). Thus, #(C; S) = {S \ C : C 2 C}. Then the growth function of C is the function C : N! N defined as follows: for m 2 N, C (m) =max{#(c; S) :S Z, S = m}. The VC-dimension of C is (infinity, or) the largest value of m such that C (m) =2 m.(aset S of size m such that #(C; S) =2 m is said to be shattered by C.) Define the classes M + l := M + R,, l : 2Y n,r X, M l := M R,, l : 2Y n,r X. Denote by M + (m) thegrowthfunctionoftheclassm +.By[28](seealsoTheorem4.3of l l [1]), it follows that PX,Y m E + l apple 4 M + (2m)exp m 2 ( l 1 )/32 l

9 9 and PX,Y m E l apple 4 M (2m)exp m 2 ( l 1 )/32. l Define G(m, )tobeanupperboundonln M + (2m) and ln M (2m),tobespecified later, such that choosing s 32 8(C +1) ( ):= G(m, )+ln (3.9) m makes the inequality ( ) ( l 1 )holdforall 2 l,asrequiredforproposition1and for the definition of ( )in(3.4). Thensubstituting l 1 for in (3.9) and letting (3.9) be the choice for ( l 1 )in(3.7),itfollowsfromtheorems3.7,4.3of[1]thatbothp E + l and P (E l )areboundedfromaboveby l 1 /2(C +1). From Proposition 1, it follows that (3.6) is bounded as follows:! XL+1 [ XL+1 PX,Y m E( ) apple PX,Y m (E l ) (3.10) 2 l l=1 apple l=1 XL+1 XL+1 PX,Y m E + l + PX,Y m E l (3.11) l=1 apple 2 2(C +1) XL+1 = C +1 l=1 = C +1 L X l=0 = C +1 L X l=1 l=1 L+1 X l 1 l l=1!! l +1! l 1 =. (3.12) In the next section, we derive a value of G(m, )whichboundsfromabovethelogarithmof the growth functions of M + and M. 4 Bounding the growth function In this section we bound the growth functions of the classes M + and M. 4.1 Half-spaces of X Define ind(p i ) 2 [N] tobetheindexofapointx such that p i = x ind(pi ),thatis,ind(p i )is the index of a prototype with respect to the pre-determined ordering of the distance space X.Clearly,eachp i has a unique ind(p i )value.

10 10 4 BOUNDING THE GROWTH FUNCTION For any i, j 2 [N], define an a ned half-space set as follows: Let the class of such sets be W (i,j) := {x : d(x, x j ) d(x, x i ) > }. (4.1) W := W (i,j) :1apple i, j apple N, j 6= i. 4.2 Matrix representation Recall that X = {x 1,...,x N }. Since a prototype may be any point in X then, in general, for any pair of prototypes p, q 2X there is some 1 apple i 6= j apple N, suchthatp = x i, q = x j. Write W (p,q) 0 := W (i,j) 0. Then W (i,j) 0 corresponds to the positive elements of the following vector: 2 3 F (i) j := 6 4 d(x 1,x j ) d(x 1,x i ). d(x N,x j ) d(x N,x i ) 7 5. (4.2) Note that taking the sign of a vector F (i) j yields a partition of X into two parts, referred to as half-spaces. For a real vector v 2 R N define sgn(v) =[sgn(v 1 ),...,sgn(v N )]. Hence sgn(f (i) j )correspondstoahalf-spaceonx. Fix any point x i 2X and let the N (N 1) matrix F i be defined by F (i) = h F (i) 1,...,F (i) i 1,F(i) i+1,...,f(i) N i (4.3) where the j th column is F (i) j, j 6= i, 1apple j apple N. Define the N N(N 1) matrix The binary matrix where F := F (1),...,F (N). (4.4) sgn(f ):= sgn(f (1) ),...,sgn(f (N) ) sgn(f (i) ):= represents the class of all half-spaces on X. h i sgn(f (i) 1 ),...,sgn(f (i) N ), 4.3 Thresholding by The set W (i,j) 0 corresponds to some column of the matrix F.Wenowdefineamoregeneral matrix whose columns corresponds to the sets W (i,j) defined in (4.1), for any fixed > 0. Bounding the VC dimension of this matrix means that we obtain a bound on the VC

11 4.3 Thresholding by 11 dimension of the class W. (By the VC-dimension of a binary matrix, we mean that of the set system in which the indicator functions of the sets correspond to the columns of a matrix, with a 1-entry denoting inclusion in the set.) For any 1 apple i 6= j apple N, asetw (i,j) corresponds to the positive elements of the vector F (i) j 1 where 1 is an N 1vectorofallones.DenotebyJ an N N(N 1) matrix of all ones. For any >0, let us consider the N N(N 1) matrix h i F := F J = F (1) 2 1,...,F (N) N 1 1. (4.5) The matrix F corresponds to the class W of sets, where for column F (i) j elements of the vector correspond to the elements of the set W (i,j). 1, thepositive The binary matrix sgn(f )correspondstoaclassof a ned half-spacesonx (the columns of sgn(f )). We now choose the constant L of section 3 to be the number of distinct positive entries of F,denotedas0<a L <a L 1 < <a 1 < 1wherea 1 and a L are the maximum and minimum positive entries of F,respectively. For1apple i apple L, definethemultiplicityof a i,denotedbym i,1appleiapplel, asthenumberoftimesthata i appears in F. We refer to S F = {a 1,a 2,...,a L } as the positive set of F and we set a L+1 =0,a 0 =1. We henceforth choose for l (defined in section 3) the value l := a l thus we have l := (a l,a l 1 ], 1 apple l apple L and L+1 := [0,a L ]. From [10] we have the following bound on the VC-dimension of sgn(f ), where VC(sgn (F )) apple w( ) (4.6) w( ):=w l 1, for 2 l, 1 apple l apple L +1 (4.7) is a non-increasing step function taking the constant value w l := log 2 ( ( l )+1) over the interval l and the ( l )(definedfurtherbelowinsection5)arebasedonthe multiplicity values m l of the positive entries a l.thevalueof 0 is 0 and (0) = 0 so w 0 =0. Let T X and denote by (F ) T the sub-matrix of F restricted to the rows that correspond to the elements of T. Sauer s Lemma (see for instance, Theorem 3.6 in [1]) implies that the number of distinct columns of sgn((f ) T ), denoted by sgn((f ) T ),isboundedasfollows: sgn((f ) T ) apple apple w( ) X T i i=0 w( ) e T. w( )

12 12 4 BOUNDING THE GROWTH FUNCTION Since F corresponds to the class W then the number of dichotomies of the class of sets W on T is bounded as follows w( ) e T #(W ; T ) apple. (4.8) w( ) Because w( )isastepfunctionovertheintervals l, thentherightsideof(4.8)isalsoastep function over these intervals and it su ces to derive its values at the interval boundaries a l. Note that a l 2 l+1 so for = a l,1applelapplel +1, wl e T #(W al ; T ) apple. (4.9) For l =0,sincew 0 =0,wehave#(W a0 ; T )=1. w l 4.4 Bounding the growth function of M + a l and M al We bound the growth function of the class M + a l. We first fix the prototype-label vector and let R run over all possible n-prototype sets. We denote by M +,a l the corresponding class of sets M + R,,a l, R X. Proposition 2. For any fixed 2Y n, for any subset S X {1}, the number of dichotomies # M +,a l ; S obtained by M +,a l on S is bounded as follows: where S + := {x 2X :(x, 1) 2 S}. # M +,a l ; S apple #(W al ; S + ) N +( )N ( ), Proof. The number of dichotomies that the class M +,a l of sets M + R,,a l gets on S is the same as the number of dichotomies that the class V +,a l of sets V + R,,a l := {x :(x, 1) 2 M + R,,a l } obtains on S +. Hence it su ces to find an upper bound on # V +,a l ; S +. Fix any set of prototypes R X of cardinality n. Wehave V + R,,a l = x : min d(x, p j) min d(x, p i) >a l j2n ( ) i2n + ( ) = x : 9i 2 N + ( ), 8j 2 N ( ),d(x, p j ) d(x, p i ) >a l = [ \ {x : d(x, p j ) d(x, p i ) >a l } i2n + ( ) = [ i2n + ( ) j2n ( ) \ j2n ( ) Define the N ( )-fold intersection set A (R) i := \ W (ind(p i),ind(p j )) a l. j2n ( ) W (ind(p i),ind(p j )) a l and the class of such sets by A R := n o A (R) i : i 2 N + ( ).

13 4.4 Bounding the growth function of M + a l and M al 13 From Theorem 13.5(iii) of [17] it follows that Now, let # A R ; S + apple # W + a l ; S + N ( ). B R := [ i2n + ( ) A (R) i and denote the class of all such N + ( )-fold unions by From Theorem 13.5(iv) of [17] it follows that Hence we have Finally, we have and Combining all of the above we obtain B := {B R : R X}. # B; S + apple # A + R ; S+ N +( ). # B; S + apple # W + a l ; S + N ( )N +( ). V + R,,a l = B R V +,a l = B. # M +,a l ; S =# V +,a l ; S + =# B; S + apple # W + a l ; S + N ( )N +( ). Denote the set of dichotomies of S by all sets M + R,,a l as U R,,al (S) := u M + R,,a l : M + R,,a l 2M +,a l. Letting be unfixed, the number of dichotomies obtained by M + a l satisfies {U R,,al (S) : 2Y n,r X, R = n} = [ 2Y n {U R,,al (S) : R X, R = n} apple X 2Y n {U R,,al (S) : R X, R = n} (4.10) apple X 2Y n #(W al ; S + ) N +( )N ( ) (4.11) Xn 1 n = k k=1 #(W al ; S + ) k(n k), (4.12) where (4.11) follows from Proposition 2.

14 14 4 BOUNDING THE GROWTH FUNCTION Remark 3. The expression 2 n (#(W al ; S + )) n(n 1) is a simple yet trivial bound on (4.11). We now obtain a much tighter bound than the one in Remark 3. Proposition 4. Define w by w := #(W al ; S + ). Then the following bound holds: Xn 1 n w k(n k k=1 k) apple w n2 n 4 n. (4.13) bn/2c Proof :Wehavethatk(n k) hasamaximumatk = n/2 andtherefore k(n k) apple n2 4. Furthermore, for all k, Hence Xn 1 n w k(n k k=1 n n apple. k bn/2c n k) apple n bn/2c w n2 /4. From (4.12) and Proposition 4, it follows that for any S X {1} of cardinality m, # M + a l ; S apple w n2 n 4 n. (4.14) bn/2c We now consider the class M al. Proposition 5. For any fixed 2 Y n, for any subset S X { 1}, the number of dichotomies # M,al ; S obtained by M,al on S is bounded from above by the right side of (4.14). Proof. We follow the proof of Proposition 2, and instead of the class M +,a l and a set M + R,,a l we consider the class M,al and a set M R,,al.Bydefinitionthissetequals M R,,al = (x, 1) : min j2n ( ) d (x, p j) min i2n + ( ) d (x, p i) < a l and can be written as M R,,al = (x, 1) : min i2n + ( ) d (x, p i) min d (x, p j) >a l. (4.15) j2n ( )

15 4.5 Finalizing 15 Thus as in the proof of Proposition 2, the set M R,, corresponds to a set V R,,al which is defined as V R,,al = x : min d(x, p i) i2n + ( ) min d(x, p j) >a l j2n ( ) = x : 9j 2 N ( ), 8i 2 N + ( ),d(x, p i ) d(x, p j ) >a l = [ \ {x : d(x, p i ) d(x, p j ) >a l } j2n ( ) = [ j2n ( ) i2n + ( ) \ i2n + ( ) W (ind(p j),ind(p i )) a l. From here the proof proceeds as the proof of Proposition 2, just swapping: the indices i with j, thesetsn with N + and the sets S + with S. 4.5 Finalizing In the previous section we derived an upper bound on the number of dichotomies on any set of cardinality m obtained by M +,a l or M,al in terms of the number of dichotomies by the classes W al.fort X,letw(T ):=#(W al ; T )anddefinethegrowthfunctionofw al as w m := From (4.14) and from the fact that w m of M +,a l is bounded as follows, max w(t ). T : T =m M +,al (m) apple w n 2 4 m n 1, it follows that the growth function M +,al (m) n bn/2c Using a standard bound for the central binomial coe cient, we have r n 2 < bn/2c n 2n and so From (4.9) we have and therefore ln M +,al (m) r n M +,al (m) apple w 2 2n m 4 2n. w m apple apple ln w m n 2 apple n2 4 ln apple n2 w(a l ) 4 4 w(al ) em w(a l ) r! 2n 2n. (4.16)! w(al ) em + n ln n w(a l ) 2 ln em ln + n ln n w(a l ) 2 ln. (4.17)

16 16 5 MAIN RESULT From (4.7), we have w(a l )=w l.therefore(4.17)equals n 2 w l em 4 ln + n ln n 2 ln w l We now define G(m, )insection3.2asfollows( recallthatg(m, )boundsthegrowth function evaluated at 2m): for any 2 l+1, G(m, ):= n2 w l 4 ln 2em + n ln n 2 ln w l Note that for all 2 l, G(m, )=G(m, a l 1 )andthesecondterminsidethesquareroot in (3.9) satisfies ln(8(c +1)/ ) ln(8(c +1)/a l 1 ), so it follows that ( ) (a l 1 ) and therefore the requirement on ( )ofsection3.2,namely,thatitisnon-decreasingas decreases, is satisfied.. 5 Main result To obtain the main result, we draw on some results and notations from [10]. We define a few quantities, leaving the dependence on N implicit. Let us denote by the i th shell of the binary N-dimensional cube { of all vertices that have i components that are 1. 1, 1} N (N-cube) the set Denote by the number of vertices in the j th shell. c j := N j Denote by b n := nx c j (5.1) the number of vertices of the cube contained in the first n shells, and let b 0 := 0 Let us define `n := j=1 nx jc j, (5.2) j=1 the total weight (number of 1-entries) in all of the vertices of the cube that are in the first n shells. For a positive integer m define Q(m) :=min{q : `q m}. (5.3) For instance, if m =17,N =4,then =17soQ(17) = 3.

17 17 Let :=m `Q(m) 1, and then define dme as the following rounded value: ( m if Q(m) = 1 or mod Q(m) =0 dme := m +(Q(m) modq(m)) otherwise. So, dme is the smallest integer greater than or equal to m which is the total weight (number of 1-entries) in a set of vertices of the cube, where that set is formed by first populating the first shell, then the second, and so on. So, m 1-entries might not be enough to form all the vertices of shells 1 to Q(m) 1andthensomeverticesinshellQ(m): an additional (at most Q(m) 1) 1-entries might be necessary (to complete a vertex in shell Q(m)). For instance (continuing the above example), since Q(17) = 3 2 then d17e = 17 +(3 1 mod 3) = 19, and indeed 19 1-entries are required in the vertices in shells 1 and 2, and in the first vertex of the 3 rd shell. For positive integer m let us define (m) :=b Q(m) 1 + dme `Q(m) 1 Q(m) (5.4) and define earlier): (0) := 0. Define the numbers i as follows (where m i is the multiplicity of a i,as 0 =0 i = d i 1 + m i e, 1 apple i apple L. (5.5) Note that i,1apple i apple L, dependonlyonm i and hence can be evaluated directly from the matrix F. The following is the main result of the paper. Theorem 6. Let N 1 and n 2 be fixed integers and X = {x i } N i=1 be a finite distance space with a distance function d(x i,x j ), normalized such that diam(x ) = max 1applei,jappleN d(x i,x j )=1. Let 2 3 d(x 1,x j ) d(x 1,x i ) f (i) 6 7 j := 4. 5, d(x N,x j ) d(x N,x i ) and and define the N N(N F (i) = 1) matrix h f (i) 1,...,f (i) i 1,f(i) i+1,...,f(i) F := F (1),...,F (N). Let 0=a L+1 <a L < <a 1 <a 0 =1be the values of the positive entries of F and let m l 1 be the number of times that a l appears in F, 1 apple l apple L. Define l := (a l,a l 1 ], 1 apple l apple L, L+1 := [0,a L ] and C := P L l=1 a l. For 1 apple l apple L, let w l =log 2 ( ( l )+1). N i

18 18 5 MAIN RESULT Let Y = { 1, 1} and let (R, ) (X Y) n, R := {p i } n i=1 denote any set of n prototypes p i with N + ( ) of them that are labeled by 1 and N ( ) that are labeled by 1. Let h R, be a nearest-prototype binary classifier, given by h R, (x) = ( 1 if argmin 1appleiapplen d(x, p i ) 2 N + ( ) 1 otherwise (5.6) and define its signed width function f R, as f R, (x) = min d (x, p j) min d (x, p i). j2n ( ) i2n + ( ) Let P m := PXY m be a probability measure over (X Y)m. For any 0 < <1, with P m - probability at least 1 the following holds for an i.i.d. sample := {(X i,y i )} m i=1 (X Y)m drawn according to P m : for any set of n labeled prototypes (R, ) (X Y) n, where (R, ) may depend on the sample (and in particular, may be a subset of ) for all >0, P (Yf R, (X) apple 0) apple 1 m mx I {Y j f R, (X j ) apple } + (m,, ) (5.7) j=1 where for 2 l+1,0apple l apple L, (m,, ):= s 32 n2 w l m 4 ln 2em + n ln n 2 ln w l 8(C +1) +ln (5.8) where w l := log 2 ( ( l )+1). The size N of the distance space does not enter the bound of Theorem 6 but because the value of w l and C depend on the distance space through the positive entries of the matrix F then these quantities may grow with N (depending on the distance space X ). The value of w l decreases as l := l( )decreases(becausetheintervals l are situated more to the right as l decreases). Thus w l decreases as a step function over the intervals l as increases such that the larger the value of, the lower the interval index l and the lower the value of w l. Thus the upper bound (5.8) decreases as increases (assuming that the change in w l 1 dominates the change in the ln term). To get a feel for the rate of decrease of w l with respect to,seeexampleonp.23of[10]. Since is a parameter that may be chosen after the random sample is drawn, it can depend on the sample, which makes the bound data-dependent.

19 5.1 Comparison with other results Comparison with other results Theorem 6 states an upper bound on the generalization error for learning on any finite q n distance space decreases with respect to m like O 2 ln m. Let us compare this rate to m other works. If the distance space is X = R d and if the prototype set R is a subset of the sample then Theorem 19.6 of [17] gives the following error bound for learning nearest-neighbor classifiers that are based on R: = O s 1 m (d +1)n(n 1) 2! 1 ln m + n +ln q n which is also O 2 ln m,butincontrasttotheorem6isnotdata-dependent,whichin m general makes it looser. In [15], Theorem 1 presents an error bound for learning LVQ on R d which has a dependence q 2 an on the sample margin of the following form O 2,wherea =min d +1, and m bounds the magnitude of each sample point, 0 < < 1 is the sample margin (which is defined 2 similar to the width (2.4)). They claim that this bound is independent of the dimension d, 2 presumably because if is smaller than d +1 then the d +1 factor disappears from the bound. Their proof is not included; however, it appears to be a direct application of fat-shattering error bounds, see for instance, Theorem 4.18 of [16] which bounds the generalization error of learning linear classifiers and has the same dependence on the margin 2.Theseboundsarebasedonaboundonthelogofthe parameter, namely, O -covering 2. number by O(d )andaboundonthefat-shatteringnumberforlinearclassifiersd apple In comparison, the bound of Theorem 6 also depends on but in a non-direct way through w l (as discussed above, l = l( )). Depending on the distance space s positive set S F, w l may decrease even faster than 1 2. Although we assume that the distance space is finite of cardinality N, thebound(5.8)mayor may not grow with N and this depends on the matrix F.Incomparisontotheaboveworks, in (5.8) there is an implicit complexity quantity which enters through w l and is defined as ( l ). This is an upper bound on the pseudo-rank of the perturbed matrix F,denotedby F al,(see[10]),whichisthenumberofdistinctcolumnsofamatrixsgn(f al )andmaydepend on N (depending on the definition of the distance space). 6 Conclusions We use the concept of width to learn a family of classifiers based on the nearest-prototype decision rule over an arbitrary finite distance spaces (a significantly more general and perhaps

20 20 REFERENCES more applicable setting than that of metric spaces). In this setting a classifier is represented by a set of n prototypes, that can be any labeled points in the distance space, and even a subset of the sample. We obtain an upper bound on the generalization error of any such nearest-prototype classifier. Using as the width parameter, the error bound depends on the -empirical error and holds uniformly over the family of such classifiers. Ignoring the q n dependence on,theboundiso 2 ln m.thedependenceon is more subtle because m it enters through a complexity quantity ( l )whichboundsfromabovethepseudo-rankof a -perturbed version of a matrix that represents all half spaces in the distance space. Acknowledgements This work was supported in part by a research grant from the Suntory and Toyota International Centres for Economics and Related Disciplines at the London School of Economics. The authors thank the referees for their comments which resulted in shorter proofs of Propositions 2 and 4. References [1] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, [2] M. Anthony and P. L. Bartlett. Function learning from interpolation. Combinatorics, Probability, and Computing, 9: ,2000. [3] M. Anthony and J. Ratsaby. Classification based on prototypes with spheres of influence. Information and Computation, 256: ,2017. [4] M. Anthony and J. Ratsaby. Maximal width learning of binary functions. Theoretical Computer Science, 411: ,2010. [5] M. Anthony and J. Ratsaby. A hybrid classifier based on boxes and nearest neighbors. Discrete Applied Mathematics, 172:1 11,2014. [6] M. Anthony and J. Ratsaby. Analysis of a multi-category classifier. Discrete Applied Mathematics, 160(16-17): ,2012. [7] M. Anthony and J. Ratsaby. Learning bounds via sample width for classifiers on finite metric spaces. Theoretical Computer Science, 529:2 10,2014. [8] M. Anthony and J. Ratsaby. Multi-category classifiers and sample width. Journal of Computer Systems and Sciences, 82(8): ,2016. [9] M. Anthony and J. Ratsaby. A probabilistic approach to case-based inference. Theoretical Computer Science, 589:61 75,2015. [10] M. Anthony and J. Ratsaby. Large-width bounds for learning half-spaces on distance spaces. Submitted, 2017.

21 REFERENCES 21 [11] A. Belousov and J. Ratsaby. Massively parallel computations of the LZ-complexity of strings,. In Proc. of the 28th IEEE Convention of Electrical and Electronics Engineers in Israel (IEEEI 14), pagespp.1 5,Eilat,Dec [12] A. Belousov and J. Ratsaby. A parallel distributed processing algorithm for image feature extraction. In Advances in Intelligent Data Analysis XIV - 14th International Symposium, IDA 2015, Saint-Etienne, France, October 22-24, Proceedings, volume 9385 of Lecture Notes in Computer Science. Springer,2015. [13] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 36(4): , [14] T. Cover and P. Hart. Nearest neighbor pattern classification. Information Theory, IEEE Transactions on, 13(1):21 27, January [15] K. Crammer, R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin analysis of the LVQ algorithm. In Advances in Neural Information Processing Systems 15 [Neural Information Processing Systems, NIPS 2002, December 9-14, 2002, Vancouver, British Columbia, Canada], pages ,2002. [16] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and other Kernel-based learning methods. Cambridge University Press, [17] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer Verlag, [18] E. Deza M. Deza. Encyclopedia of Distances, volume15ofseries in Computer Science. Springer-Verlag, [19] R. M. Dudley. Uniform Central Limit Theorems, Cambridge Studies in Advanced Mathematics, 63, Cambridge University Press, Cambridge, UK, [20] B. Hammer, M. Strickert, and T. Villmann. On the generalization ability of GRLVQ networks. Neural Processing Letters, 21(2): ,2005. [21] T. Kohonen. Self-Organizing Maps. Springer,2001. [22] E. Pekalska and R. P. W. Duin The Dissimilarity Representation for Pattern Recognition : Foundations and Applications. World Scientific Publishing Co Pte Ltd, Singapore, SINGAPORE, [23] D. Pollard (1984). Convergence of Stochastic Processes. Springer-Verlag. [24] P. Schneider, M. Biehl, and B. Hammer. Adaptive relevance matrices in learning vector quantization. Neural Computation, 21(12): ,2009. [25] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1996: [26] A. J. Smola, P. L. Bartlett, B. Scholkopf, and D. Schuurmans. Advances in Large-Margin Classifiers (Neural Information Processing). MITPress,2000. [27] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

22 22 REFERENCES [28] V.N. Vapnik and A.Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2), , 1971.

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony Department of Mathematics London School of Economics and Political Science Houghton Street, London WC2A2AE

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering

More information

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors R u t c o r Research R e p o r t The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony a Joel Ratsaby b RRR 17-2011, October 2011 RUTCOR Rutgers Center for Operations

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A 2AE United Kingdom m.anthony@lse.ac.uk Joel Ratsaby Ben-Gurion

More information

A Result of Vapnik with Applications

A Result of Vapnik with Applications A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Using boxes and proximity to classify data into several categories

Using boxes and proximity to classify data into several categories R u t c o r Research R e p o r t Using boxes and proximity to classify data into several categories Martin Anthony a Joel Ratsaby b RRR 7-01, February 01 RUTCOR Rutgers Center for Operations Research Rutgers

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Yale University Department of Computer Science. The VC Dimension of k-fold Union Yale University Department of Computer Science The VC Dimension of k-fold Union David Eisenstat Dana Angluin YALEU/DCS/TR-1360 June 2006, revised October 2006 The VC Dimension of k-fold Union David Eisenstat

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Computational Learning Theory for Artificial Neural Networks

Computational Learning Theory for Artificial Neural Networks Computational Learning Theory for Artificial Neural Networks Martin Anthony and Norman Biggs Department of Statistical and Mathematical Sciences, London School of Economics and Political Science, Houghton

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

A Necessary Condition for Learning from Positive Examples

A Necessary Condition for Learning from Positive Examples Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

CSE 648: Advanced algorithms

CSE 648: Advanced algorithms CSE 648: Advanced algorithms Topic: VC-Dimension & ɛ-nets April 03, 2003 Lecturer: Piyush Kumar Scribed by: Gen Ohkawa Please Help us improve this draft. If you read these notes, please send feedback.

More information

The information-theoretic value of unlabeled data in semi-supervised learning

The information-theoretic value of unlabeled data in semi-supervised learning The information-theoretic value of unlabeled data in semi-supervised learning Alexander Golovnev Dávid Pál Balázs Szörényi January 5, 09 Abstract We quantify the separation between the numbers of labeled

More information

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 1 / 30 The

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Generalization theory

Generalization theory Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the

More information

The Decision List Machine

The Decision List Machine The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

The Vapnik-Chervonenkis Dimension

The Vapnik-Chervonenkis Dimension The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and

More information

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012 Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv:203.093v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Neural Network Learning: Testing Bounds on Sample Complexity

Neural Network Learning: Testing Bounds on Sample Complexity Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

R u t c o r Research R e p o r t. Large margin case-based reasoning. Martin Anthony a. Joel Ratsaby b. RRR , February 2013

R u t c o r Research R e p o r t. Large margin case-based reasoning. Martin Anthony a. Joel Ratsaby b. RRR , February 2013 R u t c o r Research R e p o r t Large margin case-based reasoning Martin Anthony a Joel Ratsaby b RRR 2-2013, February 2013 RUTCOR Rutgers Center for Operations Research Rutgers University 640 Bartholomew

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Computational learning theory. PAC learning. VC dimension.

Computational learning theory. PAC learning. VC dimension. Computational learning theory. PAC learning. VC dimension. Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics COLT 2 Concept...........................................................................................................

More information

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines

More information

Voronoi Networks and Their Probability of Misclassification

Voronoi Networks and Their Probability of Misclassification IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 1361 Voronoi Networks and Their Probability of Misclassification K. Krishna, M. A. L. Thathachar, Fellow, IEEE, and K. R. Ramakrishnan

More information

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Analysis of Multiclass Support Vector Machines

Analysis of Multiclass Support Vector Machines Analysis of Multiclass Support Vector Machines Shigeo Abe Graduate School of Science and Technology Kobe University Kobe, Japan abe@eedept.kobe-u.ac.jp Abstract Since support vector machines for pattern

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8] Computational learning theory Setting 1: learner poses queries to teacher Setting 2: teacher chooses examples Setting

More information

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Sinh Hoa Nguyen, Hung Son Nguyen Polish-Japanese Institute of Information Technology Institute of Mathematics, Warsaw University February 14, 2006 inh Hoa Nguyen, Hung Son

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

On A-distance and Relative A-distance

On A-distance and Relative A-distance 1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY 14853 On A-distance and Relative A-distance Ting He and Lang Tong Technical Report No. ACSP-TR-08-04-0 August 004

More information

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data P. Schneider, M. Biehl Mathematics and Computing Science, University of Groningen, The Netherlands email: {petra,

More information

A Uniform Convergence Bound for the Area Under the ROC Curve

A Uniform Convergence Bound for the Area Under the ROC Curve Proceedings of the th International Workshop on Artificial Intelligence & Statistics, 5 A Uniform Convergence Bound for the Area Under the ROC Curve Shivani Agarwal, Sariel Har-Peled and Dan Roth Department

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley Department of Statistics 367 Evans Hall, CA 94720-3860, USA bartlett@stat.berkeley.edu

More information

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about

More information

CONSIDER a measurable space and a probability

CONSIDER a measurable space and a probability 1682 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 46, NO 11, NOVEMBER 2001 Learning With Prior Information M C Campi and M Vidyasagar, Fellow, IEEE Abstract In this paper, a new notion of learnability is

More information

A Pseudo-Boolean Set Covering Machine

A Pseudo-Boolean Set Covering Machine A Pseudo-Boolean Set Covering Machine Pascal Germain, Sébastien Giguère, Jean-Francis Roy, Brice Zirakiza, François Laviolette, and Claude-Guy Quimper Département d informatique et de génie logiciel, Université

More information

On Learning µ-perceptron Networks with Binary Weights

On Learning µ-perceptron Networks with Binary Weights On Learning µ-perceptron Networks with Binary Weights Mostefa Golea Ottawa-Carleton Institute for Physics University of Ottawa Ottawa, Ont., Canada K1N 6N5 050287@acadvm1.uottawa.ca Mario Marchand Ottawa-Carleton

More information

Generalization Bounds for the Area Under an ROC Curve

Generalization Bounds for the Area Under an ROC Curve Generalization Bounds for the Area Under an ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Models of Language Acquisition: Part II

Models of Language Acquisition: Part II Models of Language Acquisition: Part II Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 Probably Approximately Correct Model of Language Learning General setting of Statistical

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Web-Mining Agents Computational Learning Theory

Web-Mining Agents Computational Learning Theory Web-Mining Agents Computational Learning Theory Prof. Dr. Ralf Möller Dr. Özgür Özcep Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Exercise Lab) Computational Learning Theory (Adapted)

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

CS446: Machine Learning Spring Problem Set 4

CS446: Machine Learning Spring Problem Set 4 CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell CS340 Machine learning Lecture 4 Learning theory Some slides are borrowed from Sebastian Thrun and Stuart Russell Announcement What: Workshop on applying for NSERC scholarships and for entry to graduate

More information

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION International Journal of Pure and Applied Mathematics Volume 87 No. 6 2013, 741-750 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v87i6.2

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

On Locating-Dominating Codes in Binary Hamming Spaces

On Locating-Dominating Codes in Binary Hamming Spaces Discrete Mathematics and Theoretical Computer Science 6, 2004, 265 282 On Locating-Dominating Codes in Binary Hamming Spaces Iiro Honkala and Tero Laihonen and Sanna Ranto Department of Mathematics and

More information

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti MACHINE LEARNING Vapnik-Chervonenkis (VC) Dimension Alessandro Moschitti Department of Information Engineering and Computer Science University of Trento Email: moschitti@disi.unitn.it Computational Learning

More information

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all 1. Computational Learning Abstract. I discuss the basics of computational learning theory, including concept classes, VC-dimension, and VC-density. Next, I touch on the VC-Theorem, ɛ-nets, and the (p,

More information

Relating Data Compression and Learnability

Relating Data Compression and Learnability Relating Data Compression and Learnability Nick Littlestone, Manfred K. Warmuth Department of Computer and Information Sciences University of California at Santa Cruz June 10, 1986 Abstract We explore

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

List coloring hypergraphs

List coloring hypergraphs List coloring hypergraphs Penny Haxell Jacques Verstraete Department of Combinatorics and Optimization University of Waterloo Waterloo, Ontario, Canada pehaxell@uwaterloo.ca Department of Mathematics University

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present

More information

GI01/M055: Supervised Learning

GI01/M055: Supervised Learning GI01/M055: Supervised Learning 1. Introduction to Supervised Learning October 5, 2009 John Shawe-Taylor 1 Course information 1. When: Mondays, 14:00 17:00 Where: Room 1.20, Engineering Building, Malet

More information