Large width nearest prototype classification on general distance spaces
|
|
- Darlene Kennedy
- 5 years ago
- Views:
Transcription
1 Large width nearest prototype classification on general distance spaces Martin Anthony 1 and Joel Ratsaby 2 1 Department of Mathematics, London School of Economics and Political Science, Houghton Street, London WC2A2AE, U.K. 2 Department of Electrical and Electronics Engineering, Ariel University of Samaria, ARIEL 40700, ISRAEL Abstract In this paper we consider the problem of learning nearest-prototype classifiers in any finite distance space; that is, in any finite set equipped with a distance function. An important advantage of a distance space over a metric space is that the triangle inequality need not be satisfied, which makes our results potentially very useful in practice. We consider a family of binary classifiers for learning nearest-prototype classification on distance spaces, building on the concept of large-width learning which we introduced and studied in earlier works. Nearest-prototype is a more general version of the ubiquitous nearest-neighbor classifier: a prototype may or may not be a sample point. One advantage in the approach taken in this paper is that the error bounds depend on a width parameter, which can be sample-dependent and thereby yield a tighter bound. 1 Introduction Learning Vector Quantization (LVQ) and its various extensions introduced by Kohonen [21] are used successfully in many machine learning tools and applications. Learning pattern classification by LVQ is based on adapting a fixed set of labeled prototypes in Euclidean space and using the resulting set of prototypes in a nearest-prototype rule (winner-take-all) to classify any point in the input space. As [20] mentions, LVQ fails if Euclidean representation is not well-suited for the data; and there have been extensions of LVQ to try to allow di erent metrics [20, 24] and take advantage of samples for which a more confident (or a large margin) classification can be obtained. Generalization error bounds with dependence on this sample margin are stated in [20, 24] for learning over Euclidean spaces and, as is usually the case for large-margin learning [1], the error bounds are tighter than ones with no sample-margin dependence. The results of such work are important as they explain why LVQ works well in practice in Euclidean metric spaces. 1
2 2 1 INTRODUCTION There are learning domains in which it is di cult to formalize quantitative features that are encoded as numerical variables which together constitute a vector space (usually Euclidean) to discriminate between objects that belong to di erent classes [22]. Learning over such domains requires a qualitative approach which tries to describe the structure of objects in a way that is similar to how humans do, for instance, in terms of morphological elements of objects. Objects are then represented not by numerical vectors but by other means such as strings of symbols which can be compared using a dissimilarity (or distance) function. This approach is much more flexible than the one based on numerical features since there are many existing distance functions [18] and new ones can be defined easily for any kind of objects, for instance, bioinformatic sequences, graphs, images, etc., and they do not have to satisfy the requirements of a metric. However, most learning algorithms, in particular neural networks which have been very succesfull recently, require a Euclidean, or more generally vector spaces, that are represented by numerical features. Such problem domains are potential applications of prototype-based learning over non-euclidean, or more generally, non-metric spaces. In [3] we studied learning binary classification with nearest-prototype classifiers over metric spaces and obtained sample-dependent error bounds. In the current paper we consider learning binary classification on finite distance spaces; that is, finite sets equipped with adistancefunction(oftencalled dissimilaritymeasure [18])wheretheclassifiers(which we call nearest-prototype classifiers) are generalizations of the well known nearest neighbor classifier [14]. An important advantage of a distance space over a metric space is that the triangle inequality need not be satisfied, which makes our results potentially very useful in practice. Our definition of distance function is quite loose in that it does not need to satisfy any of the non-negativity, symmetry or reflexivity properties of a proper distance function [18]. We still call it a distance because, as far as we can expect in applying our learning results, any useful space has at least the non-negativity property and so we will assume in the paper that the distance function satisfies the non-negativity property. We consider a family of binary classifiers for learning nearest-prototype classification on distance spaces, building on the concept of large-width learning which was introduced in [4] and expanded in various classification settings [5 10]. The advantage in this approach is that the error bounds depend on the width parameter which can be sample-dependent and thereby yield a tighter error bound. We define a width function which measures the di erence between the distance from a test point x (to be classified) to its nearest negative prototype and the distance to its nearest positive prototype. The classifier s decision is defined as the sign of this di erence. The set of prototypes from which these two nearest ones are obtained is very general in that it can be any set of points in the distance space. In particular, it can be a subset of the sample and can be determined via any algorithm. The error bounds that we state in the current paper apply regardless of the algorithm that is used to determine these prototypes. The fact that we deal with a distance space, rather than a metric space, means that the triangle inequality need not be satisfied. The error bound depends on quantities that can be evaluated directly from a matrix which consists of the half-space functions over the distance space. This matrix is a function of the distance matrix of the space and hence it can be computed e ciently using massively parallel processing techniques, for instance, as in [11, 12]. Also, as mentioned above, our use of the concept of distance is loose, so that, for instance,
3 3 not only that the triangle inequality need not be satisfied, but also none of the three standard properties of a distance need to be satisfied either. 2 Setup 2.1 Nearest-prototype classifiers For a positive integer n, let[n] :={1, 2,...,n}. WeconsiderafinitesetX := {x 1,...,x N } with a binary set Y = { 1, 1} of possible classifications. Let d, a distancefunction,bea function from X Xto R. Let us assume that d is normalized such that diam (X ):= max d(x i,x j )=1. 1applei,jappleN A prototype p i 2X is a point in the distance space that has an associated label denote by i 2Y.We p + i,p j aprototypewithlabel i = 1andaprototypewithlabel j = 1, respectively. When the label of a prototype is not explicitly mentioned we write p i. Given a fixed integer n 2, we consider learning the family of classifiers that are defined by the nearest-neighbor rule defined by a set of n prototypes. We refer to any such classifier by h R, and it is defined by an ordered set of prototypes R = {p i } n i=1 and their corresponding label vector T := [ 1,..., n], i 2Y,1appleiapplen. The order of the set R can be defined based on any ordering of X and it allows us to fix in advance and consider all classifiers obtained by all ordered sets R. We define h R, to be a nearest-prototype classifier as follows: let N + = N + ( ):={i 2 [n] : i =1},N = N ( ):={i 2 [n] : i = 1}. Then, given x, ( 1 if argmin h R, (x) = 1appleiapplen d(x, p i ) 2 N 1 otherwise. Let us denote by asampleofm labeled examples (2.1) := {(X i,y i )} m i=1. (2.2) In the current paper, a prototype p i may be any point in X ;inparticular,onethatdepends on the sample directly or via some learning algorithm. For instance, if (R, )=, then h R, is the well known nearest-neighbor classifier [14]. If the labeled prototype set is only asubsetofthesample,(r, ), thenh R, belongs to a family of the so-called edited nearest-neighbor classifiers [17]. The prototype set (R, )maynotnecessarilybeasubset of the sample, but could be derived from it via some adaptive procedure such as the LVQ algorithm [21], in which case h R, would be the LVQ classifier. In the current paper, any such classifier is referred to as a nearest-prototype classifier and, as mentioned above, is denoted by h R,.
4 4 2 SETUP 2.2 Probabilistic modeling of learning We work in the framework of the popular PAC model of computational learning theory (see [13, 27]). This model assumes that the labeled examples (X i,y i )inthetrainingsample have been generated randomly according to some fixed (but unknown) probability distribution P on Z = X Y. (This includes, as a special case, the situation in which each X i is drawn according to a fixed distribution on X and is then labeled deterministically by Y i = t(x i )wheret is some fixed function.) Thus, a sample (2.2) of length m can be regarded as being drawn randomly according to the product probability distribution P m.ingeneral, suppose that H is a set of functions from X to { 1, 1}. An appropriate measure of how well h 2 H would perform on further randomly drawn points is its error, er P (h), the probability that h(x) 6= Y for random (X, Y )whichcanbeexpressedas er P (h) =P (h(x) 6= Y )=P (Yh(X) < 0). (2.3) Given any function h 2 H, we can measure how well h matches the training sample through its sample error er (h) = 1 m {i : h(x i) 6= Y i } (the proportion of points in the sample incorrectly classified by h). Much classical work in learning theory (see [13, 27], for instance) related the error of a classifier h to its sample error. A typical result would state that, for all 2 (0, 1), with probability at least 1,forallh 2 H we have er P (h) < er (h)+ (m, ), where (m, )(knownasageneralization error bound) is decreasing in m and. Such results can be derived using uniform convergence theorems from probability theory [19, 23, 28], in which case (m, )wouldtypicallyinvolveaquantityknown as the growth function of the set of classifiers [1, 13, 27, 28]. More recently, emphasis has been placed on learning with a large margin. (See, for instance [1, 2, 25, 26].) The rationale behind margin-based, or width-based generalization error bounds is that if a classifier has managed to achieve a wide separation between the points of di erent classification, then this indicates that it is a good classifier, and it is possible that a better generalization error bound can be obtained. Margin-based results apply when the classifiers are derived from real-valued function by thresholding (taking their sign). A more direct approach which does not require real-valued functions as a basis for classification margin, uses the concept of width (introduced in [4]) and studied in various settings in [5 10]. 2.3 Width and error of a classifier We define the width of h R, at a point x2 X as follows: w hr, (x) := min 1appleiapplen: i6=h(x) d (x, p i) min 1appleiapplen: i=h(x) d (x, p i). (2.4) In words, the width of h at x is the di erence between the distance to the nearest-unlikeprototype of x and the distance to the nearest-prototype to x where unlike means of a di erent sign than x. In [10] we consider binary classifiers that are based on a pair of oppositely labeled prototypes and use this definition of width, which, in this case, becomes simply d(x, p ) d(x, p + ).
5 2.3 Width and error of a classifier 5 The signed width (or margin) function corresponding to (2.4) is defined as f R, (x) :=f hr, (x) =h R, (x)w hr, (x). (2.5) Note that this definition means that for x equidistant from two oppositely labeled prototypes p, q 2 R that are each the closest to x from all other prototypes of the same label, the value of the margin f R, (x) atthisx is zero. This definition is intuitive and actually makes the analysis simpler compared to an alternative definition of width [7]. This definition of width is an application of a more general definition of width, introduced in [5, 6], which takes the form f(x) =d(x, S ) d(x, S + ), where S and S + are any disjoint subsets of the input space that are labeled 1and1,respectively. (In[7 9],aslightly di erent definition of width was used where the union of the disjoint sets S and S + equals the input space.) In [3] we considered binary classifiers which are also based on prototypes where the decision is not based on the nearest-prototype but is based on the combined influence of several prototypes based on certain regions of influence. The present notion of width was not explicitly utilized there. For a positive margin parameter -margin error is defined as > 0 and a training sample, the empirical (sample) ˆP m (Yf R, (X) apple )= 1 m mx I (Y j f R, (X j ) apple ). j=1 (Here, I(A) istheindicatorfunctionoftheset, orevent, A.) Define the function sgn(a) := ( 1 if a>0 1 if a apple 0. For the purpose of bounding the generalization error it is convenient to express the classification h(x) intermsofthesignedwidthasfollows, h R, (X) =sgn(f R, (X)). Therefore the generalization error er P (h R, )canbeboundedasfollows er P (h R, )=P (h R, (X) 6= Y ) (2.6) = P (Yf R, (X) < 0) + P (Y =1,f R, (X) =0) apple P (Yf R, (X) apple 0). (2.7) Our aim is to show that the upper bound (2.7) on the generalization misclassification error is not much greater than ˆP m (Yf R, (X) apple ). Explicitly, we aim for bounds of the form: for all 2 (0, 1), with probability at least 1,forall 2 (0, diam(x )], we have er P (h R, ) apple ˆP m (Yf R, (X) apple )+ (m,, ).
6 6 3 TOWARDS BOUNDING THE PROBABILITY This will imply that if the learner finds a hypothesis which, for a large value of small -margin error, then that hypothesis is likely to have small error.,hasa The advantage of working with the notion of width is that it is possible to have such a uniform bound over a very large family of classifiers. For instance, in [7] we obtained such bounds for learning the family of all possible binary classifiers on any finite metric space and in [8] we did the same for multi-category classifiers over infinite metric spaces. As mentioned above, in the current work, we consider particular kinds of classifiers h R, that are defined on the nearest-prototype rule based on a fixed number n of prototypes. Thus we expect that the bound that we obtain is tighter than the one in [7], which holds for the family of all binary classifiers. To obtain a uniform bound, we are interested in showing that the probability of the bad event namely, that there exists some value of and some classifier h R, such that the generalization error is not bounded from above by some small deviation from the empirical -margin error is small. That is, we aim to bound the following failure probability: P m X,Y ( : 9, 9, 9R, P (Yf R, (X) apple 0) > 1 m mx I {Y j f R, (X j ) apple j=1 } + )!. (2.8) This can be expressed as follows: ( P m X,Y : 9, 9, 9R, P (Yf R, (X) > 0) < 1 m )! mx I {Y j f R, (X j ) > }. (2.9) j=1 Let us fix P m X,Y for now, and deal with bounding the probability ( : 9, 9R, P (Yf R, (X) > 0) < 1 m )! mx I {y j f R, (x j ) > }. (2.10) j=1 3 Towards bounding the probability 3.1 Representing the bad event by related sets Define the set M R,, X Y as follows, M R,, := {(x, y) : yf R, (x) > } and let M + R,, := {(x, 1) : f R, (x) > } (3.1) M R,, := {(x, 1) : f R, (x) < }. (3.2)
7 3.1 Representing the bad event by related sets 7 Note that M R,, = nm R,, \ {(x, y) :y =1} o [ nm R,, \ {(x, y) :y = 1} o = M + R,, [ MR,,. (3.3) We can write and 1 m P (Yf R, (X) > )=P (M R,, ) mx I {Y j f R, (X j ) > } = P m (M R,, ), j=1 where P m denotes the empirical measure based on a sample of length m. expressed as PX,Y m ({ : 9, 9, 9R, P (M R,,0 ) <P m (M R,, ) }). Let ( ) be any function depending on E( ) (X Y) m as Thus (2.9) is, in a way to be specified later and define the set E( ):={ : 9 9R, P (M R,,0 ) <P m (M R,, ) ( )}. (3.4) Then substituting ( )for in (2.10) implies that (2.10) equals the probability PX,Y m (E( )). It follows that (2.8) equals 0 [ E( ) A. (3.5) P m X,Y 2(0,diam(X )] Let L be an integer (to be specified in a section further below). For integer 0 apple l apple L +1, let l be a decreasing sequence such that the following conditions hold: 1. 0 apple l apple =1, L+1 =0. Denote by While all the above quantities L, implicit in the notation. LX C := l. l=1 l and C may depend on X,we keep this dependence Define l := ( l, l 1] for1apple l apple L +1. Then(3.5)equals!! L+1 [ [ XL+1 [ PX,Y m E( ) apple PX,Y m E( ). (3.6) 2 l 2 l l=1 Define the set E l (X Y) m as E l := { : 9 9R, P (M R,, l ) <P m (M R,, l ) ( l 1 )}. Henceforth, assume that ( ) is a non-increasing function over each interval l. l=1
8 8 3 TOWARDS BOUNDING THE PROBABILITY Proposition 1. For any 2 l, E( ) E l. Proof. We have M R,,0 M R,, l ;thusp (M R,,0 ) P (M R,, l ). And M R,, l M R,, since l apple.thereforep m (M R,, ) apple P m (M R,, l ). For apple l 1,bytheaboveassumptionon, ( ) ( l 1 ). It follows that E( ) E l. The event that there exists and R such that P (M R,, l ) <P m (M R,, l ) ( l 1 )holds, together with (3.3), implies that either of the following events occurs: there exists and R such that P M + R,, l <P m M + R,, l ( l 1 )/2 or there exists a and R such that P M R,, l <P m M R,, l ( l 1 )/2. Let and Then E + l := : 9 9R, P M + R,, l <P m M + R,, l ( l 1 )/2 (3.7) E l := : 9 9R, P M R,, l <P m M R,, l ( l 1 )/2. (3.8) P m X,Y (E l ) apple P m X,Y E + l + P m X,Y E l. 3.2 Bounding the probability in terms of the growth function We now aim to bound from above the first probability P m X,Y E + l. We briefly first recall the definitions of growth function and VC-dimension [28]. Suppose that C is a collection of subsets of a set Z. LetS be any (finite) subset of Z. Thenadichotomy of S by C is a set of the form S \ C where C 2C.WedenotethenumberofdichotomiesofS by C as #(C; S). Thus, #(C; S) = {S \ C : C 2 C}. Then the growth function of C is the function C : N! N defined as follows: for m 2 N, C (m) =max{#(c; S) :S Z, S = m}. The VC-dimension of C is (infinity, or) the largest value of m such that C (m) =2 m.(aset S of size m such that #(C; S) =2 m is said to be shattered by C.) Define the classes M + l := M + R,, l : 2Y n,r X, M l := M R,, l : 2Y n,r X. Denote by M + (m) thegrowthfunctionoftheclassm +.By[28](seealsoTheorem4.3of l l [1]), it follows that PX,Y m E + l apple 4 M + (2m)exp m 2 ( l 1 )/32 l
9 9 and PX,Y m E l apple 4 M (2m)exp m 2 ( l 1 )/32. l Define G(m, )tobeanupperboundonln M + (2m) and ln M (2m),tobespecified later, such that choosing s 32 8(C +1) ( ):= G(m, )+ln (3.9) m makes the inequality ( ) ( l 1 )holdforall 2 l,asrequiredforproposition1and for the definition of ( )in(3.4). Thensubstituting l 1 for in (3.9) and letting (3.9) be the choice for ( l 1 )in(3.7),itfollowsfromtheorems3.7,4.3of[1]thatbothp E + l and P (E l )areboundedfromaboveby l 1 /2(C +1). From Proposition 1, it follows that (3.6) is bounded as follows:! XL+1 [ XL+1 PX,Y m E( ) apple PX,Y m (E l ) (3.10) 2 l l=1 apple l=1 XL+1 XL+1 PX,Y m E + l + PX,Y m E l (3.11) l=1 apple 2 2(C +1) XL+1 = C +1 l=1 = C +1 L X l=0 = C +1 L X l=1 l=1 L+1 X l 1 l l=1!! l +1! l 1 =. (3.12) In the next section, we derive a value of G(m, )whichboundsfromabovethelogarithmof the growth functions of M + and M. 4 Bounding the growth function In this section we bound the growth functions of the classes M + and M. 4.1 Half-spaces of X Define ind(p i ) 2 [N] tobetheindexofapointx such that p i = x ind(pi ),thatis,ind(p i )is the index of a prototype with respect to the pre-determined ordering of the distance space X.Clearly,eachp i has a unique ind(p i )value.
10 10 4 BOUNDING THE GROWTH FUNCTION For any i, j 2 [N], define an a ned half-space set as follows: Let the class of such sets be W (i,j) := {x : d(x, x j ) d(x, x i ) > }. (4.1) W := W (i,j) :1apple i, j apple N, j 6= i. 4.2 Matrix representation Recall that X = {x 1,...,x N }. Since a prototype may be any point in X then, in general, for any pair of prototypes p, q 2X there is some 1 apple i 6= j apple N, suchthatp = x i, q = x j. Write W (p,q) 0 := W (i,j) 0. Then W (i,j) 0 corresponds to the positive elements of the following vector: 2 3 F (i) j := 6 4 d(x 1,x j ) d(x 1,x i ). d(x N,x j ) d(x N,x i ) 7 5. (4.2) Note that taking the sign of a vector F (i) j yields a partition of X into two parts, referred to as half-spaces. For a real vector v 2 R N define sgn(v) =[sgn(v 1 ),...,sgn(v N )]. Hence sgn(f (i) j )correspondstoahalf-spaceonx. Fix any point x i 2X and let the N (N 1) matrix F i be defined by F (i) = h F (i) 1,...,F (i) i 1,F(i) i+1,...,f(i) N i (4.3) where the j th column is F (i) j, j 6= i, 1apple j apple N. Define the N N(N 1) matrix The binary matrix where F := F (1),...,F (N). (4.4) sgn(f ):= sgn(f (1) ),...,sgn(f (N) ) sgn(f (i) ):= represents the class of all half-spaces on X. h i sgn(f (i) 1 ),...,sgn(f (i) N ), 4.3 Thresholding by The set W (i,j) 0 corresponds to some column of the matrix F.Wenowdefineamoregeneral matrix whose columns corresponds to the sets W (i,j) defined in (4.1), for any fixed > 0. Bounding the VC dimension of this matrix means that we obtain a bound on the VC
11 4.3 Thresholding by 11 dimension of the class W. (By the VC-dimension of a binary matrix, we mean that of the set system in which the indicator functions of the sets correspond to the columns of a matrix, with a 1-entry denoting inclusion in the set.) For any 1 apple i 6= j apple N, asetw (i,j) corresponds to the positive elements of the vector F (i) j 1 where 1 is an N 1vectorofallones.DenotebyJ an N N(N 1) matrix of all ones. For any >0, let us consider the N N(N 1) matrix h i F := F J = F (1) 2 1,...,F (N) N 1 1. (4.5) The matrix F corresponds to the class W of sets, where for column F (i) j elements of the vector correspond to the elements of the set W (i,j). 1, thepositive The binary matrix sgn(f )correspondstoaclassof a ned half-spacesonx (the columns of sgn(f )). We now choose the constant L of section 3 to be the number of distinct positive entries of F,denotedas0<a L <a L 1 < <a 1 < 1wherea 1 and a L are the maximum and minimum positive entries of F,respectively. For1apple i apple L, definethemultiplicityof a i,denotedbym i,1appleiapplel, asthenumberoftimesthata i appears in F. We refer to S F = {a 1,a 2,...,a L } as the positive set of F and we set a L+1 =0,a 0 =1. We henceforth choose for l (defined in section 3) the value l := a l thus we have l := (a l,a l 1 ], 1 apple l apple L and L+1 := [0,a L ]. From [10] we have the following bound on the VC-dimension of sgn(f ), where VC(sgn (F )) apple w( ) (4.6) w( ):=w l 1, for 2 l, 1 apple l apple L +1 (4.7) is a non-increasing step function taking the constant value w l := log 2 ( ( l )+1) over the interval l and the ( l )(definedfurtherbelowinsection5)arebasedonthe multiplicity values m l of the positive entries a l.thevalueof 0 is 0 and (0) = 0 so w 0 =0. Let T X and denote by (F ) T the sub-matrix of F restricted to the rows that correspond to the elements of T. Sauer s Lemma (see for instance, Theorem 3.6 in [1]) implies that the number of distinct columns of sgn((f ) T ), denoted by sgn((f ) T ),isboundedasfollows: sgn((f ) T ) apple apple w( ) X T i i=0 w( ) e T. w( )
12 12 4 BOUNDING THE GROWTH FUNCTION Since F corresponds to the class W then the number of dichotomies of the class of sets W on T is bounded as follows w( ) e T #(W ; T ) apple. (4.8) w( ) Because w( )isastepfunctionovertheintervals l, thentherightsideof(4.8)isalsoastep function over these intervals and it su ces to derive its values at the interval boundaries a l. Note that a l 2 l+1 so for = a l,1applelapplel +1, wl e T #(W al ; T ) apple. (4.9) For l =0,sincew 0 =0,wehave#(W a0 ; T )=1. w l 4.4 Bounding the growth function of M + a l and M al We bound the growth function of the class M + a l. We first fix the prototype-label vector and let R run over all possible n-prototype sets. We denote by M +,a l the corresponding class of sets M + R,,a l, R X. Proposition 2. For any fixed 2Y n, for any subset S X {1}, the number of dichotomies # M +,a l ; S obtained by M +,a l on S is bounded as follows: where S + := {x 2X :(x, 1) 2 S}. # M +,a l ; S apple #(W al ; S + ) N +( )N ( ), Proof. The number of dichotomies that the class M +,a l of sets M + R,,a l gets on S is the same as the number of dichotomies that the class V +,a l of sets V + R,,a l := {x :(x, 1) 2 M + R,,a l } obtains on S +. Hence it su ces to find an upper bound on # V +,a l ; S +. Fix any set of prototypes R X of cardinality n. Wehave V + R,,a l = x : min d(x, p j) min d(x, p i) >a l j2n ( ) i2n + ( ) = x : 9i 2 N + ( ), 8j 2 N ( ),d(x, p j ) d(x, p i ) >a l = [ \ {x : d(x, p j ) d(x, p i ) >a l } i2n + ( ) = [ i2n + ( ) j2n ( ) \ j2n ( ) Define the N ( )-fold intersection set A (R) i := \ W (ind(p i),ind(p j )) a l. j2n ( ) W (ind(p i),ind(p j )) a l and the class of such sets by A R := n o A (R) i : i 2 N + ( ).
13 4.4 Bounding the growth function of M + a l and M al 13 From Theorem 13.5(iii) of [17] it follows that Now, let # A R ; S + apple # W + a l ; S + N ( ). B R := [ i2n + ( ) A (R) i and denote the class of all such N + ( )-fold unions by From Theorem 13.5(iv) of [17] it follows that Hence we have Finally, we have and Combining all of the above we obtain B := {B R : R X}. # B; S + apple # A + R ; S+ N +( ). # B; S + apple # W + a l ; S + N ( )N +( ). V + R,,a l = B R V +,a l = B. # M +,a l ; S =# V +,a l ; S + =# B; S + apple # W + a l ; S + N ( )N +( ). Denote the set of dichotomies of S by all sets M + R,,a l as U R,,al (S) := u M + R,,a l : M + R,,a l 2M +,a l. Letting be unfixed, the number of dichotomies obtained by M + a l satisfies {U R,,al (S) : 2Y n,r X, R = n} = [ 2Y n {U R,,al (S) : R X, R = n} apple X 2Y n {U R,,al (S) : R X, R = n} (4.10) apple X 2Y n #(W al ; S + ) N +( )N ( ) (4.11) Xn 1 n = k k=1 #(W al ; S + ) k(n k), (4.12) where (4.11) follows from Proposition 2.
14 14 4 BOUNDING THE GROWTH FUNCTION Remark 3. The expression 2 n (#(W al ; S + )) n(n 1) is a simple yet trivial bound on (4.11). We now obtain a much tighter bound than the one in Remark 3. Proposition 4. Define w by w := #(W al ; S + ). Then the following bound holds: Xn 1 n w k(n k k=1 k) apple w n2 n 4 n. (4.13) bn/2c Proof :Wehavethatk(n k) hasamaximumatk = n/2 andtherefore k(n k) apple n2 4. Furthermore, for all k, Hence Xn 1 n w k(n k k=1 n n apple. k bn/2c n k) apple n bn/2c w n2 /4. From (4.12) and Proposition 4, it follows that for any S X {1} of cardinality m, # M + a l ; S apple w n2 n 4 n. (4.14) bn/2c We now consider the class M al. Proposition 5. For any fixed 2 Y n, for any subset S X { 1}, the number of dichotomies # M,al ; S obtained by M,al on S is bounded from above by the right side of (4.14). Proof. We follow the proof of Proposition 2, and instead of the class M +,a l and a set M + R,,a l we consider the class M,al and a set M R,,al.Bydefinitionthissetequals M R,,al = (x, 1) : min j2n ( ) d (x, p j) min i2n + ( ) d (x, p i) < a l and can be written as M R,,al = (x, 1) : min i2n + ( ) d (x, p i) min d (x, p j) >a l. (4.15) j2n ( )
15 4.5 Finalizing 15 Thus as in the proof of Proposition 2, the set M R,, corresponds to a set V R,,al which is defined as V R,,al = x : min d(x, p i) i2n + ( ) min d(x, p j) >a l j2n ( ) = x : 9j 2 N ( ), 8i 2 N + ( ),d(x, p i ) d(x, p j ) >a l = [ \ {x : d(x, p i ) d(x, p j ) >a l } j2n ( ) = [ j2n ( ) i2n + ( ) \ i2n + ( ) W (ind(p j),ind(p i )) a l. From here the proof proceeds as the proof of Proposition 2, just swapping: the indices i with j, thesetsn with N + and the sets S + with S. 4.5 Finalizing In the previous section we derived an upper bound on the number of dichotomies on any set of cardinality m obtained by M +,a l or M,al in terms of the number of dichotomies by the classes W al.fort X,letw(T ):=#(W al ; T )anddefinethegrowthfunctionofw al as w m := From (4.14) and from the fact that w m of M +,a l is bounded as follows, max w(t ). T : T =m M +,al (m) apple w n 2 4 m n 1, it follows that the growth function M +,al (m) n bn/2c Using a standard bound for the central binomial coe cient, we have r n 2 < bn/2c n 2n and so From (4.9) we have and therefore ln M +,al (m) r n M +,al (m) apple w 2 2n m 4 2n. w m apple apple ln w m n 2 apple n2 4 ln apple n2 w(a l ) 4 4 w(al ) em w(a l ) r! 2n 2n. (4.16)! w(al ) em + n ln n w(a l ) 2 ln em ln + n ln n w(a l ) 2 ln. (4.17)
16 16 5 MAIN RESULT From (4.7), we have w(a l )=w l.therefore(4.17)equals n 2 w l em 4 ln + n ln n 2 ln w l We now define G(m, )insection3.2asfollows( recallthatg(m, )boundsthegrowth function evaluated at 2m): for any 2 l+1, G(m, ):= n2 w l 4 ln 2em + n ln n 2 ln w l Note that for all 2 l, G(m, )=G(m, a l 1 )andthesecondterminsidethesquareroot in (3.9) satisfies ln(8(c +1)/ ) ln(8(c +1)/a l 1 ), so it follows that ( ) (a l 1 ) and therefore the requirement on ( )ofsection3.2,namely,thatitisnon-decreasingas decreases, is satisfied.. 5 Main result To obtain the main result, we draw on some results and notations from [10]. We define a few quantities, leaving the dependence on N implicit. Let us denote by the i th shell of the binary N-dimensional cube { of all vertices that have i components that are 1. 1, 1} N (N-cube) the set Denote by the number of vertices in the j th shell. c j := N j Denote by b n := nx c j (5.1) the number of vertices of the cube contained in the first n shells, and let b 0 := 0 Let us define `n := j=1 nx jc j, (5.2) j=1 the total weight (number of 1-entries) in all of the vertices of the cube that are in the first n shells. For a positive integer m define Q(m) :=min{q : `q m}. (5.3) For instance, if m =17,N =4,then =17soQ(17) = 3.
17 17 Let :=m `Q(m) 1, and then define dme as the following rounded value: ( m if Q(m) = 1 or mod Q(m) =0 dme := m +(Q(m) modq(m)) otherwise. So, dme is the smallest integer greater than or equal to m which is the total weight (number of 1-entries) in a set of vertices of the cube, where that set is formed by first populating the first shell, then the second, and so on. So, m 1-entries might not be enough to form all the vertices of shells 1 to Q(m) 1andthensomeverticesinshellQ(m): an additional (at most Q(m) 1) 1-entries might be necessary (to complete a vertex in shell Q(m)). For instance (continuing the above example), since Q(17) = 3 2 then d17e = 17 +(3 1 mod 3) = 19, and indeed 19 1-entries are required in the vertices in shells 1 and 2, and in the first vertex of the 3 rd shell. For positive integer m let us define (m) :=b Q(m) 1 + dme `Q(m) 1 Q(m) (5.4) and define earlier): (0) := 0. Define the numbers i as follows (where m i is the multiplicity of a i,as 0 =0 i = d i 1 + m i e, 1 apple i apple L. (5.5) Note that i,1apple i apple L, dependonlyonm i and hence can be evaluated directly from the matrix F. The following is the main result of the paper. Theorem 6. Let N 1 and n 2 be fixed integers and X = {x i } N i=1 be a finite distance space with a distance function d(x i,x j ), normalized such that diam(x ) = max 1applei,jappleN d(x i,x j )=1. Let 2 3 d(x 1,x j ) d(x 1,x i ) f (i) 6 7 j := 4. 5, d(x N,x j ) d(x N,x i ) and and define the N N(N F (i) = 1) matrix h f (i) 1,...,f (i) i 1,f(i) i+1,...,f(i) F := F (1),...,F (N). Let 0=a L+1 <a L < <a 1 <a 0 =1be the values of the positive entries of F and let m l 1 be the number of times that a l appears in F, 1 apple l apple L. Define l := (a l,a l 1 ], 1 apple l apple L, L+1 := [0,a L ] and C := P L l=1 a l. For 1 apple l apple L, let w l =log 2 ( ( l )+1). N i
18 18 5 MAIN RESULT Let Y = { 1, 1} and let (R, ) (X Y) n, R := {p i } n i=1 denote any set of n prototypes p i with N + ( ) of them that are labeled by 1 and N ( ) that are labeled by 1. Let h R, be a nearest-prototype binary classifier, given by h R, (x) = ( 1 if argmin 1appleiapplen d(x, p i ) 2 N + ( ) 1 otherwise (5.6) and define its signed width function f R, as f R, (x) = min d (x, p j) min d (x, p i). j2n ( ) i2n + ( ) Let P m := PXY m be a probability measure over (X Y)m. For any 0 < <1, with P m - probability at least 1 the following holds for an i.i.d. sample := {(X i,y i )} m i=1 (X Y)m drawn according to P m : for any set of n labeled prototypes (R, ) (X Y) n, where (R, ) may depend on the sample (and in particular, may be a subset of ) for all >0, P (Yf R, (X) apple 0) apple 1 m mx I {Y j f R, (X j ) apple } + (m,, ) (5.7) j=1 where for 2 l+1,0apple l apple L, (m,, ):= s 32 n2 w l m 4 ln 2em + n ln n 2 ln w l 8(C +1) +ln (5.8) where w l := log 2 ( ( l )+1). The size N of the distance space does not enter the bound of Theorem 6 but because the value of w l and C depend on the distance space through the positive entries of the matrix F then these quantities may grow with N (depending on the distance space X ). The value of w l decreases as l := l( )decreases(becausetheintervals l are situated more to the right as l decreases). Thus w l decreases as a step function over the intervals l as increases such that the larger the value of, the lower the interval index l and the lower the value of w l. Thus the upper bound (5.8) decreases as increases (assuming that the change in w l 1 dominates the change in the ln term). To get a feel for the rate of decrease of w l with respect to,seeexampleonp.23of[10]. Since is a parameter that may be chosen after the random sample is drawn, it can depend on the sample, which makes the bound data-dependent.
19 5.1 Comparison with other results Comparison with other results Theorem 6 states an upper bound on the generalization error for learning on any finite q n distance space decreases with respect to m like O 2 ln m. Let us compare this rate to m other works. If the distance space is X = R d and if the prototype set R is a subset of the sample then Theorem 19.6 of [17] gives the following error bound for learning nearest-neighbor classifiers that are based on R: = O s 1 m (d +1)n(n 1) 2! 1 ln m + n +ln q n which is also O 2 ln m,butincontrasttotheorem6isnotdata-dependent,whichin m general makes it looser. In [15], Theorem 1 presents an error bound for learning LVQ on R d which has a dependence q 2 an on the sample margin of the following form O 2,wherea =min d +1, and m bounds the magnitude of each sample point, 0 < < 1 is the sample margin (which is defined 2 similar to the width (2.4)). They claim that this bound is independent of the dimension d, 2 presumably because if is smaller than d +1 then the d +1 factor disappears from the bound. Their proof is not included; however, it appears to be a direct application of fat-shattering error bounds, see for instance, Theorem 4.18 of [16] which bounds the generalization error of learning linear classifiers and has the same dependence on the margin 2.Theseboundsarebasedonaboundonthelogofthe parameter, namely, O -covering 2. number by O(d )andaboundonthefat-shatteringnumberforlinearclassifiersd apple In comparison, the bound of Theorem 6 also depends on but in a non-direct way through w l (as discussed above, l = l( )). Depending on the distance space s positive set S F, w l may decrease even faster than 1 2. Although we assume that the distance space is finite of cardinality N, thebound(5.8)mayor may not grow with N and this depends on the matrix F.Incomparisontotheaboveworks, in (5.8) there is an implicit complexity quantity which enters through w l and is defined as ( l ). This is an upper bound on the pseudo-rank of the perturbed matrix F,denotedby F al,(see[10]),whichisthenumberofdistinctcolumnsofamatrixsgn(f al )andmaydepend on N (depending on the definition of the distance space). 6 Conclusions We use the concept of width to learn a family of classifiers based on the nearest-prototype decision rule over an arbitrary finite distance spaces (a significantly more general and perhaps
20 20 REFERENCES more applicable setting than that of metric spaces). In this setting a classifier is represented by a set of n prototypes, that can be any labeled points in the distance space, and even a subset of the sample. We obtain an upper bound on the generalization error of any such nearest-prototype classifier. Using as the width parameter, the error bound depends on the -empirical error and holds uniformly over the family of such classifiers. Ignoring the q n dependence on,theboundiso 2 ln m.thedependenceon is more subtle because m it enters through a complexity quantity ( l )whichboundsfromabovethepseudo-rankof a -perturbed version of a matrix that represents all half spaces in the distance space. Acknowledgements This work was supported in part by a research grant from the Suntory and Toyota International Centres for Economics and Related Disciplines at the London School of Economics. The authors thank the referees for their comments which resulted in shorter proofs of Propositions 2 and 4. References [1] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, [2] M. Anthony and P. L. Bartlett. Function learning from interpolation. Combinatorics, Probability, and Computing, 9: ,2000. [3] M. Anthony and J. Ratsaby. Classification based on prototypes with spheres of influence. Information and Computation, 256: ,2017. [4] M. Anthony and J. Ratsaby. Maximal width learning of binary functions. Theoretical Computer Science, 411: ,2010. [5] M. Anthony and J. Ratsaby. A hybrid classifier based on boxes and nearest neighbors. Discrete Applied Mathematics, 172:1 11,2014. [6] M. Anthony and J. Ratsaby. Analysis of a multi-category classifier. Discrete Applied Mathematics, 160(16-17): ,2012. [7] M. Anthony and J. Ratsaby. Learning bounds via sample width for classifiers on finite metric spaces. Theoretical Computer Science, 529:2 10,2014. [8] M. Anthony and J. Ratsaby. Multi-category classifiers and sample width. Journal of Computer Systems and Sciences, 82(8): ,2016. [9] M. Anthony and J. Ratsaby. A probabilistic approach to case-based inference. Theoretical Computer Science, 589:61 75,2015. [10] M. Anthony and J. Ratsaby. Large-width bounds for learning half-spaces on distance spaces. Submitted, 2017.
21 REFERENCES 21 [11] A. Belousov and J. Ratsaby. Massively parallel computations of the LZ-complexity of strings,. In Proc. of the 28th IEEE Convention of Electrical and Electronics Engineers in Israel (IEEEI 14), pagespp.1 5,Eilat,Dec [12] A. Belousov and J. Ratsaby. A parallel distributed processing algorithm for image feature extraction. In Advances in Intelligent Data Analysis XIV - 14th International Symposium, IDA 2015, Saint-Etienne, France, October 22-24, Proceedings, volume 9385 of Lecture Notes in Computer Science. Springer,2015. [13] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 36(4): , [14] T. Cover and P. Hart. Nearest neighbor pattern classification. Information Theory, IEEE Transactions on, 13(1):21 27, January [15] K. Crammer, R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin analysis of the LVQ algorithm. In Advances in Neural Information Processing Systems 15 [Neural Information Processing Systems, NIPS 2002, December 9-14, 2002, Vancouver, British Columbia, Canada], pages ,2002. [16] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and other Kernel-based learning methods. Cambridge University Press, [17] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer Verlag, [18] E. Deza M. Deza. Encyclopedia of Distances, volume15ofseries in Computer Science. Springer-Verlag, [19] R. M. Dudley. Uniform Central Limit Theorems, Cambridge Studies in Advanced Mathematics, 63, Cambridge University Press, Cambridge, UK, [20] B. Hammer, M. Strickert, and T. Villmann. On the generalization ability of GRLVQ networks. Neural Processing Letters, 21(2): ,2005. [21] T. Kohonen. Self-Organizing Maps. Springer,2001. [22] E. Pekalska and R. P. W. Duin The Dissimilarity Representation for Pattern Recognition : Foundations and Applications. World Scientific Publishing Co Pte Ltd, Singapore, SINGAPORE, [23] D. Pollard (1984). Convergence of Stochastic Processes. Springer-Verlag. [24] P. Schneider, M. Biehl, and B. Hammer. Adaptive relevance matrices in learning vector quantization. Neural Computation, 21(12): ,2009. [25] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1996: [26] A. J. Smola, P. L. Bartlett, B. Scholkopf, and D. Schuurmans. Advances in Large-Margin Classifiers (Neural Information Processing). MITPress,2000. [27] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.
22 22 REFERENCES [28] V.N. Vapnik and A.Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2), , 1971.
Sample width for multi-category classifiers
R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University
More informationThe Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors
The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony Department of Mathematics London School of Economics and Political Science Houghton Street, London WC2A2AE
More informationMaximal Width Learning of Binary Functions
Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering
More informationThe Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors
R u t c o r Research R e p o r t The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony a Joel Ratsaby b RRR 17-2011, October 2011 RUTCOR Rutgers Center for Operations
More informationMaximal Width Learning of Binary Functions
Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A 2AE United Kingdom m.anthony@lse.ac.uk Joel Ratsaby Ben-Gurion
More informationA Result of Vapnik with Applications
A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of
More informationStatistical and Computational Learning Theory
Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationUsing boxes and proximity to classify data into several categories
R u t c o r Research R e p o r t Using boxes and proximity to classify data into several categories Martin Anthony a Joel Ratsaby b RRR 7-01, February 01 RUTCOR Rutgers Center for Operations Research Rutgers
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationDiscriminative Learning can Succeed where Generative Learning Fails
Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,
More informationTHE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY
THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential
More informationYale University Department of Computer Science. The VC Dimension of k-fold Union
Yale University Department of Computer Science The VC Dimension of k-fold Union David Eisenstat Dana Angluin YALEU/DCS/TR-1360 June 2006, revised October 2006 The VC Dimension of k-fold Union David Eisenstat
More informationLearning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14
Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More informationComputational Learning Theory for Artificial Neural Networks
Computational Learning Theory for Artificial Neural Networks Martin Anthony and Norman Biggs Department of Statistical and Mathematical Sciences, London School of Economics and Political Science, Houghton
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationOn the Sample Complexity of Noise-Tolerant Learning
On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute
More informationA Large Deviation Bound for the Area Under the ROC Curve
A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu
More informationComputational Learning Theory
CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful
More informationA Necessary Condition for Learning from Positive Examples
Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)
More information10.1 The Formal Model
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate
More informationA note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCSE 648: Advanced algorithms
CSE 648: Advanced algorithms Topic: VC-Dimension & ɛ-nets April 03, 2003 Lecturer: Piyush Kumar Scribed by: Gen Ohkawa Please Help us improve this draft. If you read these notes, please send feedback.
More informationThe information-theoretic value of unlabeled data in semi-supervised learning
The information-theoretic value of unlabeled data in semi-supervised learning Alexander Golovnev Dávid Pál Balázs Szörényi January 5, 09 Abstract We quantify the separation between the numbers of labeled
More informationMACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension
MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 1 / 30 The
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationGeneralization theory
Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the
More informationThe Decision List Machine
The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationThe Vapnik-Chervonenkis Dimension
The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and
More informationVapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012
Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv:203.093v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationNeural Network Learning: Testing Bounds on Sample Complexity
Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationR u t c o r Research R e p o r t. Large margin case-based reasoning. Martin Anthony a. Joel Ratsaby b. RRR , February 2013
R u t c o r Research R e p o r t Large margin case-based reasoning Martin Anthony a Joel Ratsaby b RRR 2-2013, February 2013 RUTCOR Rutgers Center for Operations Research Rutgers University 640 Bartholomew
More informationVapnik-Chervonenkis Dimension of Neural Nets
P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley
More informationOn Learnability, Complexity and Stability
On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly
More informationComputational learning theory. PAC learning. VC dimension.
Computational learning theory. PAC learning. VC dimension. Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics COLT 2 Concept...........................................................................................................
More informationMACHINE LEARNING. Support Vector Machines. Alessandro Moschitti
MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines
More informationVoronoi Networks and Their Probability of Misclassification
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 1361 Voronoi Networks and Their Probability of Misclassification K. Krishna, M. A. L. Thathachar, Fellow, IEEE, and K. R. Ramakrishnan
More informationHigh Dimensional Geometry, Curse of Dimensionality, Dimension Reduction
Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,
More informationBrief Introduction to Machine Learning
Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector
More informationComputational Learning Theory
1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationMachine Learning 4771
Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines
More informationUniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning
Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A
More informationCS 6375: Machine Learning Computational Learning Theory
CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationAnalysis of Multiclass Support Vector Machines
Analysis of Multiclass Support Vector Machines Shigeo Abe Graduate School of Science and Technology Kobe University Kobe, Japan abe@eedept.kobe-u.ac.jp Abstract Since support vector machines for pattern
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationComputational Learning Theory
Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8] Computational learning theory Setting 1: learner poses queries to teacher Setting 2: teacher chooses examples Setting
More informationLecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds
Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,
More informationOptimal global rates of convergence for interpolation problems with random design
Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289
More informationComputational Learning Theory
Computational Learning Theory Sinh Hoa Nguyen, Hung Son Nguyen Polish-Japanese Institute of Information Technology Institute of Mathematics, Warsaw University February 14, 2006 inh Hoa Nguyen, Hung Son
More informationA Simple Algorithm for Learning Stable Machines
A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is
More informationRademacher Complexity Bounds for Non-I.I.D. Processes
Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department
More informationOn A-distance and Relative A-distance
1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY 14853 On A-distance and Relative A-distance Ting He and Lang Tong Technical Report No. ACSP-TR-08-04-0 August 004
More informationAdvanced metric adaptation in Generalized LVQ for classification of mass spectrometry data
Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data P. Schneider, M. Biehl Mathematics and Computing Science, University of Groningen, The Netherlands email: {petra,
More informationA Uniform Convergence Bound for the Area Under the ROC Curve
Proceedings of the th International Workshop on Artificial Intelligence & Statistics, 5 A Uniform Convergence Bound for the Area Under the ROC Curve Shivani Agarwal, Sariel Har-Peled and Dan Roth Department
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationSupport Vector Machines
Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6
More informationVapnik-Chervonenkis Dimension of Neural Nets
Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley Department of Statistics 367 Evans Hall, CA 94720-3860, USA bartlett@stat.berkeley.edu
More informationVC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.
VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about
More informationCONSIDER a measurable space and a probability
1682 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 46, NO 11, NOVEMBER 2001 Learning With Prior Information M C Campi and M Vidyasagar, Fellow, IEEE Abstract In this paper, a new notion of learnability is
More informationA Pseudo-Boolean Set Covering Machine
A Pseudo-Boolean Set Covering Machine Pascal Germain, Sébastien Giguère, Jean-Francis Roy, Brice Zirakiza, François Laviolette, and Claude-Guy Quimper Département d informatique et de génie logiciel, Université
More informationOn Learning µ-perceptron Networks with Binary Weights
On Learning µ-perceptron Networks with Binary Weights Mostefa Golea Ottawa-Carleton Institute for Physics University of Ottawa Ottawa, Ont., Canada K1N 6N5 050287@acadvm1.uottawa.ca Mario Marchand Ottawa-Carleton
More informationGeneralization Bounds for the Area Under an ROC Curve
Generalization Bounds for the Area Under an ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationModels of Language Acquisition: Part II
Models of Language Acquisition: Part II Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 Probably Approximately Correct Model of Language Learning General setting of Statistical
More informationE0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)
E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationWeb-Mining Agents Computational Learning Theory
Web-Mining Agents Computational Learning Theory Prof. Dr. Ralf Möller Dr. Özgür Özcep Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Exercise Lab) Computational Learning Theory (Adapted)
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationCS446: Machine Learning Spring Problem Set 4
CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationUniform concentration inequalities, martingales, Rademacher complexity and symmetrization
Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform
More informationCS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell
CS340 Machine learning Lecture 4 Learning theory Some slides are borrowed from Sebastian Thrun and Stuart Russell Announcement What: Workshop on applying for NSERC scholarships and for entry to graduate
More informationSVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION
International Journal of Pure and Applied Mathematics Volume 87 No. 6 2013, 741-750 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v87i6.2
More informationA Bound on the Label Complexity of Agnostic Active Learning
A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,
More informationOn Locating-Dominating Codes in Binary Hamming Spaces
Discrete Mathematics and Theoretical Computer Science 6, 2004, 265 282 On Locating-Dominating Codes in Binary Hamming Spaces Iiro Honkala and Tero Laihonen and Sanna Ranto Department of Mathematics and
More informationMACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti
MACHINE LEARNING Vapnik-Chervonenkis (VC) Dimension Alessandro Moschitti Department of Information Engineering and Computer Science University of Trento Email: moschitti@disi.unitn.it Computational Learning
More informationHence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all
1. Computational Learning Abstract. I discuss the basics of computational learning theory, including concept classes, VC-dimension, and VC-density. Next, I touch on the VC-Theorem, ɛ-nets, and the (p,
More informationRelating Data Compression and Learnability
Relating Data Compression and Learnability Nick Littlestone, Manfred K. Warmuth Department of Computer and Information Sciences University of California at Santa Cruz June 10, 1986 Abstract We explore
More informationOnline Learning, Mistake Bounds, Perceptron Algorithm
Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which
More informationList coloring hypergraphs
List coloring hypergraphs Penny Haxell Jacques Verstraete Department of Combinatorics and Optimization University of Waterloo Waterloo, Ontario, Canada pehaxell@uwaterloo.ca Department of Mathematics University
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationBennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence
Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present
More informationGI01/M055: Supervised Learning
GI01/M055: Supervised Learning 1. Introduction to Supervised Learning October 5, 2009 John Shawe-Taylor 1 Course information 1. When: Mondays, 14:00 17:00 Where: Room 1.20, Engineering Building, Malet
More information