Large width nearest prototype classification on general distance spaces

Size: px

Start display at page:

Download "Large width nearest prototype classification on general distance spaces"

Darlene Kennedy
5 years ago
Views:

1 Large width nearest prototype classification on general distance spaces Martin Anthony 1 and Joel Ratsaby 2 1 Department of Mathematics, London School of Economics and Political Science, Houghton Street, London WC2A2AE, U.K. 2 Department of Electrical and Electronics Engineering, Ariel University of Samaria, ARIEL 40700, ISRAEL Abstract In this paper we consider the problem of learning nearest-prototype classifiers in any finite distance space; that is, in any finite set equipped with a distance function. An important advantage of a distance space over a metric space is that the triangle inequality need not be satisfied, which makes our results potentially very useful in practice. We consider a family of binary classifiers for learning nearest-prototype classification on distance spaces, building on the concept of large-width learning which we introduced and studied in earlier works. Nearest-prototype is a more general version of the ubiquitous nearest-neighbor classifier: a prototype may or may not be a sample point. One advantage in the approach taken in this paper is that the error bounds depend on a width parameter, which can be sample-dependent and thereby yield a tighter bound. 1 Introduction Learning Vector Quantization (LVQ) and its various extensions introduced by Kohonen [21] are used successfully in many machine learning tools and applications. Learning pattern classification by LVQ is based on adapting a fixed set of labeled prototypes in Euclidean space and using the resulting set of prototypes in a nearest-prototype rule (winner-take-all) to classify any point in the input space. As [20] mentions, LVQ fails if Euclidean representation is not well-suited for the data; and there have been extensions of LVQ to try to allow di erent metrics [20, 24] and take advantage of samples for which a more confident (or a large margin) classification can be obtained. Generalization error bounds with dependence on this sample margin are stated in [20, 24] for learning over Euclidean spaces and, as is usually the case for large-margin learning [1], the error bounds are tighter than ones with no sample-margin dependence. The results of such work are important as they explain why LVQ works well in practice in Euclidean metric spaces. 1

2 2 1 INTRODUCTION There are learning domains in which it is di cult to formalize quantitative features that are encoded as numerical variables which together constitute a vector space (usually Euclidean) to discriminate between objects that belong to di erent classes [22]. Learning over such domains requires a qualitative approach which tries to describe the structure of objects in a way that is similar to how humans do, for instance, in terms of morphological elements of objects. Objects are then represented not by numerical vectors but by other means such as strings of symbols which can be compared using a dissimilarity (or distance) function. This approach is much more flexible than the one based on numerical features since there are many existing distance functions [18] and new ones can be defined easily for any kind of objects, for instance, bioinformatic sequences, graphs, images, etc., and they do not have to satisfy the requirements of a metric. However, most learning algorithms, in particular neural networks which have been very succesfull recently, require a Euclidean, or more generally vector spaces, that are represented by numerical features. Such problem domains are potential applications of prototype-based learning over non-euclidean, or more generally, non-metric spaces. In [3] we studied learning binary classification with nearest-prototype classifiers over metric spaces and obtained sample-dependent error bounds. In the current paper we consider learning binary classification on finite distance spaces; that is, finite sets equipped with adistancefunction(oftencalled dissimilaritymeasure [18])wheretheclassifiers(which we call nearest-prototype classifiers) are generalizations of the well known nearest neighbor classifier [14]. An important advantage of a distance space over a metric space is that the triangle inequality need not be satisfied, which makes our results potentially very useful in practice. Our definition of distance function is quite loose in that it does not need to satisfy any of the non-negativity, symmetry or reflexivity properties of a proper distance function [18]. We still call it a distance because, as far as we can expect in applying our learning results, any useful space has at least the non-negativity property and so we will assume in the paper that the distance function satisfies the non-negativity property. We consider a family of binary classifiers for learning nearest-prototype classification on distance spaces, building on the concept of large-width learning which was introduced in [4] and expanded in various classification settings [5 10]. The advantage in this approach is that the error bounds depend on the width parameter which can be sample-dependent and thereby yield a tighter error bound. We define a width function which measures the di erence between the distance from a test point x (to be classified) to its nearest negative prototype and the distance to its nearest positive prototype. The classifier s decision is defined as the sign of this di erence. The set of prototypes from which these two nearest ones are obtained is very general in that it can be any set of points in the distance space. In particular, it can be a subset of the sample and can be determined via any algorithm. The error bounds that we state in the current paper apply regardless of the algorithm that is used to determine these prototypes. The fact that we deal with a distance space, rather than a metric space, means that the triangle inequality need not be satisfied. The error bound depends on quantities that can be evaluated directly from a matrix which consists of the half-space functions over the distance space. This matrix is a function of the distance matrix of the space and hence it can be computed e ciently using massively parallel processing techniques, for instance, as in [11, 12]. Also, as mentioned above, our use of the concept of distance is loose, so that, for instance,

3 3 not only that the triangle inequality need not be satisfied, but also none of the three standard properties of a distance need to be satisfied either. 2 Setup 2.1 Nearest-prototype classifiers For a positive integer n, let[n] :={1, 2,...,n}. WeconsiderafinitesetX := {x 1,...,x N } with a binary set Y = { 1, 1} of possible classifications. Let d, a distancefunction,bea function from X Xto R. Let us assume that d is normalized such that diam (X ):= max d(x i,x j )=1. 1applei,jappleN A prototype p i 2X is a point in the distance space that has an associated label denote by i 2Y.We p + i,p j aprototypewithlabel i = 1andaprototypewithlabel j = 1, respectively. When the label of a prototype is not explicitly mentioned we write p i. Given a fixed integer n 2, we consider learning the family of classifiers that are defined by the nearest-neighbor rule defined by a set of n prototypes. We refer to any such classifier by h R, and it is defined by an ordered set of prototypes R = {p i } n i=1 and their corresponding label vector T := [ 1,..., n], i 2Y,1appleiapplen. The order of the set R can be defined based on any ordering of X and it allows us to fix in advance and consider all classifiers obtained by all ordered sets R. We define h R, to be a nearest-prototype classifier as follows: let N + = N + ( ):={i 2 [n] : i =1},N = N ( ):={i 2 [n] : i = 1}. Then, given x, ( 1 if argmin h R, (x) = 1appleiapplen d(x, p i ) 2 N 1 otherwise. Let us denote by asampleofm labeled examples (2.1) := {(X i,y i )} m i=1. (2.2) In the current paper, a prototype p i may be any point in X ;inparticular,onethatdepends on the sample directly or via some learning algorithm. For instance, if (R, )=, then h R, is the well known nearest-neighbor classifier [14]. If the labeled prototype set is only asubsetofthesample,(r, ), thenh R, belongs to a family of the so-called edited nearest-neighbor classifiers [17]. The prototype set (R, )maynotnecessarilybeasubset of the sample, but could be derived from it via some adaptive procedure such as the LVQ algorithm [21], in which case h R, would be the LVQ classifier. In the current paper, any such classifier is referred to as a nearest-prototype classifier and, as mentioned above, is denoted by h R,.

4 4 2 SETUP 2.2 Probabilistic modeling of learning We work in the framework of the popular PAC model of computational learning theory (see [13, 27]). This model assumes that the labeled examples (X i,y i )inthetrainingsample have been generated randomly according to some fixed (but unknown) probability distribution P on Z = X Y. (This includes, as a special case, the situation in which each X i is drawn according to a fixed distribution on X and is then labeled deterministically by Y i = t(x i )wheret is some fixed function.) Thus, a sample (2.2) of length m can be regarded as being drawn randomly according to the product probability distribution P m.ingeneral, suppose that H is a set of functions from X to { 1, 1}. An appropriate measure of how well h 2 H would perform on further randomly drawn points is its error, er P (h), the probability that h(x) 6= Y for random (X, Y )whichcanbeexpressedas er P (h) =P (h(x) 6= Y )=P (Yh(X) < 0). (2.3) Given any function h 2 H, we can measure how well h matches the training sample through its sample error er (h) = 1 m {i : h(x i) 6= Y i } (the proportion of points in the sample incorrectly classified by h). Much classical work in learning theory (see [13, 27], for instance) related the error of a classifier h to its sample error. A typical result would state that, for all 2 (0, 1), with probability at least 1,forallh 2 H we have er P (h) < er (h)+ (m, ), where (m, )(knownasageneralization error bound) is decreasing in m and. Such results can be derived using uniform convergence theorems from probability theory [19, 23, 28], in which case (m, )wouldtypicallyinvolveaquantityknown as the growth function of the set of classifiers [1, 13, 27, 28]. More recently, emphasis has been placed on learning with a large margin. (See, for instance [1, 2, 25, 26].) The rationale behind margin-based, or width-based generalization error bounds is that if a classifier has managed to achieve a wide separation between the points of di erent classification, then this indicates that it is a good classifier, and it is possible that a better generalization error bound can be obtained. Margin-based results apply when the classifiers are derived from real-valued function by thresholding (taking their sign). A more direct approach which does not require real-valued functions as a basis for classification margin, uses the concept of width (introduced in [4]) and studied in various settings in [5 10]. 2.3 Width and error of a classifier We define the width of h R, at a point x2 X as follows: w hr, (x) := min 1appleiapplen: i6=h(x) d (x, p i) min 1appleiapplen: i=h(x) d (x, p i). (2.4) In words, the width of h at x is the di erence between the distance to the nearest-unlikeprototype of x and the distance to the nearest-prototype to x where unlike means of a di erent sign than x. In [10] we consider binary classifiers that are based on a pair of oppositely labeled prototypes and use this definition of width, which, in this case, becomes simply d(x, p ) d(x, p + ).

5 2.3 Width and error of a classifier 5 The signed width (or margin) function corresponding to (2.4) is defined as f R, (x) :=f hr, (x) =h R, (x)w hr, (x). (2.5) Note that this definition means that for x equidistant from two oppositely labeled prototypes p, q 2 R that are each the closest to x from all other prototypes of the same label, the value of the margin f R, (x) atthisx is zero. This definition is intuitive and actually makes the analysis simpler compared to an alternative definition of width [7]. This definition of width is an application of a more general definition of width, introduced in [5, 6], which takes the form f(x) =d(x, S ) d(x, S + ), where S and S + are any disjoint subsets of the input space that are labeled 1and1,respectively. (In[7 9],aslightly di erent definition of width was used where the union of the disjoint sets S and S + equals the input space.) In [3] we considered binary classifiers which are also based on prototypes where the decision is not based on the nearest-prototype but is based on the combined influence of several prototypes based on certain regions of influence. The present notion of width was not explicitly utilized there. For a positive margin parameter -margin error is defined as > 0 and a training sample, the empirical (sample) ˆP m (Yf R, (X) apple )= 1 m mx I (Y j f R, (X j ) apple ). j=1 (Here, I(A) istheindicatorfunctionoftheset, orevent, A.) Define the function sgn(a) := ( 1 if a>0 1 if a apple 0. For the purpose of bounding the generalization error it is convenient to express the classification h(x) intermsofthesignedwidthasfollows, h R, (X) =sgn(f R, (X)). Therefore the generalization error er P (h R, )canbeboundedasfollows er P (h R, )=P (h R, (X) 6= Y ) (2.6) = P (Yf R, (X) < 0) + P (Y =1,f R, (X) =0) apple P (Yf R, (X) apple 0). (2.7) Our aim is to show that the upper bound (2.7) on the generalization misclassification error is not much greater than ˆP m (Yf R, (X) apple ). Explicitly, we aim for bounds of the form: for all 2 (0, 1), with probability at least 1,forall 2 (0, diam(x )], we have er P (h R, ) apple ˆP m (Yf R, (X) apple )+ (m,, ).

6 6 3 TOWARDS BOUNDING THE PROBABILITY This will imply that if the learner finds a hypothesis which, for a large value of small -margin error, then that hypothesis is likely to have small error.,hasa The advantage of working with the notion of width is that it is possible to have such a uniform bound over a very large family of classifiers. For instance, in [7] we obtained such bounds for learning the family of all possible binary classifiers on any finite metric space and in [8] we did the same for multi-category classifiers over infinite metric spaces. As mentioned above, in the current work, we consider particular kinds of classifiers h R, that are defined on the nearest-prototype rule based on a fixed number n of prototypes. Thus we expect that the bound that we obtain is tighter than the one in [7], which holds for the family of all binary classifiers. To obtain a uniform bound, we are interested in showing that the probability of the bad event namely, that there exists some value of and some classifier h R, such that the generalization error is not bounded from above by some small deviation from the empirical -margin error is small. That is, we aim to bound the following failure probability: P m X,Y ( : 9, 9, 9R, P (Yf R, (X) apple 0) > 1 m mx I {Y j f R, (X j ) apple j=1 } + )!. (2.8) This can be expressed as follows: ( P m X,Y : 9, 9, 9R, P (Yf R, (X) > 0) < 1 m )! mx I {Y j f R, (X j ) > }. (2.9) j=1 Let us fix P m X,Y for now, and deal with bounding the probability ( : 9, 9R, P (Yf R, (X) > 0) < 1 m )! mx I {y j f R, (x j ) > }. (2.10) j=1 3 Towards bounding the probability 3.1 Representing the bad event by related sets Define the set M R,, X Y as follows, M R,, := {(x, y) : yf R, (x) > } and let M + R,, := {(x, 1) : f R, (x) > } (3.1) M R,, := {(x, 1) : f R, (x) < }. (3.2)

7 3.1 Representing the bad event by related sets 7 Note that M R,, = nm R,, \ {(x, y) :y =1} o [ nm R,, \ {(x, y) :y = 1} o = M + R,, [ MR,,. (3.3) We can write and 1 m P (Yf R, (X) > )=P (M R,, ) mx I {Y j f R, (X j ) > } = P m (M R,, ), j=1 where P m denotes the empirical measure based on a sample of length m. expressed as PX,Y m ({ : 9, 9, 9R, P (M R,,0 ) <P m (M R,, ) }). Let ( ) be any function depending on E( ) (X Y) m as Thus (2.9) is, in a way to be specified later and define the set E( ):={ : 9 9R, P (M R,,0 ) <P m (M R,, ) ( )}. (3.4) Then substituting ( )for in (2.10) implies that (2.10) equals the probability PX,Y m (E( )). It follows that (2.8) equals 0 [ E( ) A. (3.5) P m X,Y 2(0,diam(X )] Let L be an integer (to be specified in a section further below). For integer 0 apple l apple L +1, let l be a decreasing sequence such that the following conditions hold: 1. 0 apple l apple =1, L+1 =0. Denote by While all the above quantities L, implicit in the notation. LX C := l. l=1 l and C may depend on X,we keep this dependence Define l := ( l, l 1] for1apple l apple L +1. Then(3.5)equals!! L+1 [ [ XL+1 [ PX,Y m E( ) apple PX,Y m E( ). (3.6) 2 l 2 l l=1 Define the set E l (X Y) m as E l := { : 9 9R, P (M R,, l ) <P m (M R,, l ) ( l 1 )}. Henceforth, assume that ( ) is a non-increasing function over each interval l. l=1

8 8 3 TOWARDS BOUNDING THE PROBABILITY Proposition 1. For any 2 l, E( ) E l. Proof. We have M R,,0 M R,, l ;thusp (M R,,0 ) P (M R,, l ). And M R,, l M R,, since l apple.thereforep m (M R,, ) apple P m (M R,, l ). For apple l 1,bytheaboveassumptionon, ( ) ( l 1 ). It follows that E( ) E l. The event that there exists and R such that P (M R,, l ) <P m (M R,, l ) ( l 1 )holds, together with (3.3), implies that either of the following events occurs: there exists and R such that P M + R,, l <P m M + R,, l ( l 1 )/2 or there exists a and R such that P M R,, l <P m M R,, l ( l 1 )/2. Let and Then E + l := : 9 9R, P M + R,, l <P m M + R,, l ( l 1 )/2 (3.7) E l := : 9 9R, P M R,, l <P m M R,, l ( l 1 )/2. (3.8) P m X,Y (E l ) apple P m X,Y E + l + P m X,Y E l. 3.2 Bounding the probability in terms of the growth function We now aim to bound from above the first probability P m X,Y E + l. We briefly first recall the definitions of growth function and VC-dimension [28]. Suppose that C is a collection of subsets of a set Z. LetS be any (finite) subset of Z. Thenadichotomy of S by C is a set of the form S \ C where C 2C.WedenotethenumberofdichotomiesofS by C as #(C; S). Thus, #(C; S) = {S \ C : C 2 C}. Then the growth function of C is the function C : N! N defined as follows: for m 2 N, C (m) =max{#(c; S) :S Z, S = m}. The VC-dimension of C is (infinity, or) the largest value of m such that C (m) =2 m.(aset S of size m such that #(C; S) =2 m is said to be shattered by C.) Define the classes M + l := M + R,, l : 2Y n,r X, M l := M R,, l : 2Y n,r X. Denote by M + (m) thegrowthfunctionoftheclassm +.By[28](seealsoTheorem4.3of l l [1]), it follows that PX,Y m E + l apple 4 M + (2m)exp m 2 ( l 1 )/32 l

9 9 and PX,Y m E l apple 4 M (2m)exp m 2 ( l 1 )/32. l Define G(m, )tobeanupperboundonln M + (2m) and ln M (2m),tobespecified later, such that choosing s 32 8(C +1) ( ):= G(m, )+ln (3.9) m makes the inequality ( ) ( l 1 )holdforall 2 l,asrequiredforproposition1and for the definition of ( )in(3.4). Thensubstituting l 1 for in (3.9) and letting (3.9) be the choice for ( l 1 )in(3.7),itfollowsfromtheorems3.7,4.3of[1]thatbothp E + l and P (E l )areboundedfromaboveby l 1 /2(C +1). From Proposition 1, it follows that (3.6) is bounded as follows:! XL+1 [ XL+1 PX,Y m E( ) apple PX,Y m (E l ) (3.10) 2 l l=1 apple l=1 XL+1 XL+1 PX,Y m E + l + PX,Y m E l (3.11) l=1 apple 2 2(C +1) XL+1 = C +1 l=1 = C +1 L X l=0 = C +1 L X l=1 l=1 L+1 X l 1 l l=1!! l +1! l 1 =. (3.12) In the next section, we derive a value of G(m, )whichboundsfromabovethelogarithmof the growth functions of M + and M. 4 Bounding the growth function In this section we bound the growth functions of the classes M + and M. 4.1 Half-spaces of X Define ind(p i ) 2 [N] tobetheindexofapointx such that p i = x ind(pi ),thatis,ind(p i )is the index of a prototype with respect to the pre-determined ordering of the distance space X.Clearly,eachp i has a unique ind(p i )value.

10 10 4 BOUNDING THE GROWTH FUNCTION For any i, j 2 [N], define an a ned half-space set as follows: Let the class of such sets be W (i,j) := {x : d(x, x j ) d(x, x i ) > }. (4.1) W := W (i,j) :1apple i, j apple N, j 6= i. 4.2 Matrix representation Recall that X = {x 1,...,x N }. Since a prototype may be any point in X then, in general, for any pair of prototypes p, q 2X there is some 1 apple i 6= j apple N, suchthatp = x i, q = x j. Write W (p,q) 0 := W (i,j) 0. Then W (i,j) 0 corresponds to the positive elements of the following vector: 2 3 F (i) j := 6 4 d(x 1,x j ) d(x 1,x i ). d(x N,x j ) d(x N,x i ) 7 5. (4.2) Note that taking the sign of a vector F (i) j yields a partition of X into two parts, referred to as half-spaces. For a real vector v 2 R N define sgn(v) =[sgn(v 1 ),...,sgn(v N )]. Hence sgn(f (i) j )correspondstoahalf-spaceonx. Fix any point x i 2X and let the N (N 1) matrix F i be defined by F (i) = h F (i) 1,...,F (i) i 1,F(i) i+1,...,f(i) N i (4.3) where the j th column is F (i) j, j 6= i, 1apple j apple N. Define the N N(N 1) matrix The binary matrix where F := F (1),...,F (N). (4.4) sgn(f ):= sgn(f (1) ),...,sgn(f (N) ) sgn(f (i) ):= represents the class of all half-spaces on X. h i sgn(f (i) 1 ),...,sgn(f (i) N ), 4.3 Thresholding by The set W (i,j) 0 corresponds to some column of the matrix F.Wenowdefineamoregeneral matrix whose columns corresponds to the sets W (i,j) defined in (4.1), for any fixed > 0. Bounding the VC dimension of this matrix means that we obtain a bound on the VC

11 4.3 Thresholding by 11 dimension of the class W. (By the VC-dimension of a binary matrix, we mean that of the set system in which the indicator functions of the sets correspond to the columns of a matrix, with a 1-entry denoting inclusion in the set.) For any 1 apple i 6= j apple N, asetw (i,j) corresponds to the positive elements of the vector F (i) j 1 where 1 is an N 1vectorofallones.DenotebyJ an N N(N 1) matrix of all ones. For any >0, let us consider the N N(N 1) matrix h i F := F J = F (1) 2 1,...,F (N) N 1 1. (4.5) The matrix F corresponds to the class W of sets, where for column F (i) j elements of the vector correspond to the elements of the set W (i,j). 1, thepositive The binary matrix sgn(f )correspondstoaclassof a ned half-spacesonx (the columns of sgn(f )). We now choose the constant L of section 3 to be the number of distinct positive entries of F,denotedas0<a L <a L 1 < <a 1 < 1wherea 1 and a L are the maximum and minimum positive entries of F,respectively. For1apple i apple L, definethemultiplicityof a i,denotedbym i,1appleiapplel, asthenumberoftimesthata i appears in F. We refer to S F = {a 1,a 2,...,a L } as the positive set of F and we set a L+1 =0,a 0 =1. We henceforth choose for l (defined in section 3) the value l := a l thus we have l := (a l,a l 1 ], 1 apple l apple L and L+1 := [0,a L ]. From [10] we have the following bound on the VC-dimension of sgn(f ), where VC(sgn (F )) apple w( ) (4.6) w( ):=w l 1, for 2 l, 1 apple l apple L +1 (4.7) is a non-increasing step function taking the constant value w l := log 2 ( ( l )+1) over the interval l and the ( l )(definedfurtherbelowinsection5)arebasedonthe multiplicity values m l of the positive entries a l.thevalueof 0 is 0 and (0) = 0 so w 0 =0. Let T X and denote by (F ) T the sub-matrix of F restricted to the rows that correspond to the elements of T. Sauer s Lemma (see for instance, Theorem 3.6 in [1]) implies that the number of distinct columns of sgn((f ) T ), denoted by sgn((f ) T ),isboundedasfollows: sgn((f ) T ) apple apple w( ) X T i i=0 w( ) e T. w( )

12 12 4 BOUNDING THE GROWTH FUNCTION Since F corresponds to the class W then the number of dichotomies of the class of sets W on T is bounded as follows w( ) e T #(W ; T ) apple. (4.8) w( ) Because w( )isastepfunctionovertheintervals l, thentherightsideof(4.8)isalsoastep function over these intervals and it su ces to derive its values at the interval boundaries a l. Note that a l 2 l+1 so for = a l,1applelapplel +1, wl e T #(W al ; T ) apple. (4.9) For l =0,sincew 0 =0,wehave#(W a0 ; T )=1. w l 4.4 Bounding the growth function of M + a l and M al We bound the growth function of the class M + a l. We first fix the prototype-label vector and let R run over all possible n-prototype sets. We denote by M +,a l the corresponding class of sets M + R,,a l, R X. Proposition 2. For any fixed 2Y n, for any subset S X {1}, the number of dichotomies # M +,a l ; S obtained by M +,a l on S is bounded as follows: where S + := {x 2X :(x, 1) 2 S}. # M +,a l ; S apple #(W al ; S + ) N +( )N ( ), Proof. The number of dichotomies that the class M +,a l of sets M + R,,a l gets on S is the same as the number of dichotomies that the class V +,a l of sets V + R,,a l := {x :(x, 1) 2 M + R,,a l } obtains on S +. Hence it su ces to find an upper bound on # V +,a l ; S +. Fix any set of prototypes R X of cardinality n. Wehave V + R,,a l = x : min d(x, p j) min d(x, p i) >a l j2n ( ) i2n + ( ) = x : 9i 2 N + ( ), 8j 2 N ( ),d(x, p j ) d(x, p i ) >a l = [ \ {x : d(x, p j ) d(x, p i ) >a l } i2n + ( ) = [ i2n + ( ) j2n ( ) \ j2n ( ) Define the N ( )-fold intersection set A (R) i := \ W (ind(p i),ind(p j )) a l. j2n ( ) W (ind(p i),ind(p j )) a l and the class of such sets by A R := n o A (R) i : i 2 N + ( ).

13 4.4 Bounding the growth function of M + a l and M al 13 From Theorem 13.5(iii) of [17] it follows that Now, let # A R ; S + apple # W + a l ; S + N ( ). B R := [ i2n + ( ) A (R) i and denote the class of all such N + ( )-fold unions by From Theorem 13.5(iv) of [17] it follows that Hence we have Finally, we have and Combining all of the above we obtain B := {B R : R X}. # B; S + apple # A + R ; S+ N +( ). # B; S + apple # W + a l ; S + N ( )N +( ). V + R,,a l = B R V +,a l = B. # M +,a l ; S =# V +,a l ; S + =# B; S + apple # W + a l ; S + N ( )N +( ). Denote the set of dichotomies of S by all sets M + R,,a l as U R,,al (S) := u M + R,,a l : M + R,,a l 2M +,a l. Letting be unfixed, the number of dichotomies obtained by M + a l satisfies {U R,,al (S) : 2Y n,r X, R = n} = [ 2Y n {U R,,al (S) : R X, R = n} apple X 2Y n {U R,,al (S) : R X, R = n} (4.10) apple X 2Y n #(W al ; S + ) N +( )N ( ) (4.11) Xn 1 n = k k=1 #(W al ; S + ) k(n k), (4.12) where (4.11) follows from Proposition 2.

14 14 4 BOUNDING THE GROWTH FUNCTION Remark 3. The expression 2 n (#(W al ; S + )) n(n 1) is a simple yet trivial bound on (4.11). We now obtain a much tighter bound than the one in Remark 3. Proposition 4. Define w by w := #(W al ; S + ). Then the following bound holds: Xn 1 n w k(n k k=1 k) apple w n2 n 4 n. (4.13) bn/2c Proof :Wehavethatk(n k) hasamaximumatk = n/2 andtherefore k(n k) apple n2 4. Furthermore, for all k, Hence Xn 1 n w k(n k k=1 n n apple. k bn/2c n k) apple n bn/2c w n2 /4. From (4.12) and Proposition 4, it follows that for any S X {1} of cardinality m, # M + a l ; S apple w n2 n 4 n. (4.14) bn/2c We now consider the class M al. Proposition 5. For any fixed 2 Y n, for any subset S X { 1}, the number of dichotomies # M,al ; S obtained by M,al on S is bounded from above by the right side of (4.14). Proof. We follow the proof of Proposition 2, and instead of the class M +,a l and a set M + R,,a l we consider the class M,al and a set M R,,al.Bydefinitionthissetequals M R,,al = (x, 1) : min j2n ( ) d (x, p j) min i2n + ( ) d (x, p i) < a l and can be written as M R,,al = (x, 1) : min i2n + ( ) d (x, p i) min d (x, p j) >a l. (4.15) j2n ( )

15 4.5 Finalizing 15 Thus as in the proof of Proposition 2, the set M R,, corresponds to a set V R,,al which is defined as V R,,al = x : min d(x, p i) i2n + ( ) min d(x, p j) >a l j2n ( ) = x : 9j 2 N ( ), 8i 2 N + ( ),d(x, p i ) d(x, p j ) >a l = [ \ {x : d(x, p i ) d(x, p j ) >a l } j2n ( ) = [ j2n ( ) i2n + ( ) \ i2n + ( ) W (ind(p j),ind(p i )) a l. From here the proof proceeds as the proof of Proposition 2, just swapping: the indices i with j, thesetsn with N + and the sets S + with S. 4.5 Finalizing In the previous section we derived an upper bound on the number of dichotomies on any set of cardinality m obtained by M +,a l or M,al in terms of the number of dichotomies by the classes W al.fort X,letw(T ):=#(W al ; T )anddefinethegrowthfunctionofw al as w m := From (4.14) and from the fact that w m of M +,a l is bounded as follows, max w(t ). T : T =m M +,al (m) apple w n 2 4 m n 1, it follows that the growth function M +,al (m) n bn/2c Using a standard bound for the central binomial coe cient, we have r n 2 < bn/2c n 2n and so From (4.9) we have and therefore ln M +,al (m) r n M +,al (m) apple w 2 2n m 4 2n. w m apple apple ln w m n 2 apple n2 4 ln apple n2 w(a l ) 4 4 w(al ) em w(a l ) r! 2n 2n. (4.16)! w(al ) em + n ln n w(a l ) 2 ln em ln + n ln n w(a l ) 2 ln. (4.17)

16 16 5 MAIN RESULT From (4.7), we have w(a l )=w l.therefore(4.17)equals n 2 w l em 4 ln + n ln n 2 ln w l We now define G(m, )insection3.2asfollows( recallthatg(m, )boundsthegrowth function evaluated at 2m): for any 2 l+1, G(m, ):= n2 w l 4 ln 2em + n ln n 2 ln w l Note that for all 2 l, G(m, )=G(m, a l 1 )andthesecondterminsidethesquareroot in (3.9) satisfies ln(8(c +1)/ ) ln(8(c +1)/a l 1 ), so it follows that ( ) (a l 1 ) and therefore the requirement on ( )ofsection3.2,namely,thatitisnon-decreasingas decreases, is satisfied.. 5 Main result To obtain the main result, we draw on some results and notations from [10]. We define a few quantities, leaving the dependence on N implicit. Let us denote by the i th shell of the binary N-dimensional cube { of all vertices that have i components that are 1. 1, 1} N (N-cube) the set Denote by the number of vertices in the j th shell. c j := N j Denote by b n := nx c j (5.1) the number of vertices of the cube contained in the first n shells, and let b 0 := 0 Let us define `n := j=1 nx jc j, (5.2) j=1 the total weight (number of 1-entries) in all of the vertices of the cube that are in the first n shells. For a positive integer m define Q(m) :=min{q : `q m}. (5.3) For instance, if m =17,N =4,then =17soQ(17) = 3.

17 17 Let :=m `Q(m) 1, and then define dme as the following rounded value: ( m if Q(m) = 1 or mod Q(m) =0 dme := m +(Q(m) modq(m)) otherwise. So, dme is the smallest integer greater than or equal to m which is the total weight (number of 1-entries) in a set of vertices of the cube, where that set is formed by first populating the first shell, then the second, and so on. So, m 1-entries might not be enough to form all the vertices of shells 1 to Q(m) 1andthensomeverticesinshellQ(m): an additional (at most Q(m) 1) 1-entries might be necessary (to complete a vertex in shell Q(m)). For instance (continuing the above example), since Q(17) = 3 2 then d17e = 17 +(3 1 mod 3) = 19, and indeed 19 1-entries are required in the vertices in shells 1 and 2, and in the first vertex of the 3 rd shell. For positive integer m let us define (m) :=b Q(m) 1 + dme `Q(m) 1 Q(m) (5.4) and define earlier): (0) := 0. Define the numbers i as follows (where m i is the multiplicity of a i,as 0 =0 i = d i 1 + m i e, 1 apple i apple L. (5.5) Note that i,1apple i apple L, dependonlyonm i and hence can be evaluated directly from the matrix F. The following is the main result of the paper. Theorem 6. Let N 1 and n 2 be fixed integers and X = {x i } N i=1 be a finite distance space with a distance function d(x i,x j ), normalized such that diam(x ) = max 1applei,jappleN d(x i,x j )=1. Let 2 3 d(x 1,x j ) d(x 1,x i ) f (i) 6 7 j := 4. 5, d(x N,x j ) d(x N,x i ) and and define the N N(N F (i) = 1) matrix h f (i) 1,...,f (i) i 1,f(i) i+1,...,f(i) F := F (1),...,F (N). Let 0=a L+1 <a L < <a 1 <a 0 =1be the values of the positive entries of F and let m l 1 be the number of times that a l appears in F, 1 apple l apple L. Define l := (a l,a l 1 ], 1 apple l apple L, L+1 := [0,a L ] and C := P L l=1 a l. For 1 apple l apple L, let w l =log 2 ( ( l )+1). N i

18 18 5 MAIN RESULT Let Y = { 1, 1} and let (R, ) (X Y) n, R := {p i } n i=1 denote any set of n prototypes p i with N + ( ) of them that are labeled by 1 and N ( ) that are labeled by 1. Let h R, be a nearest-prototype binary classifier, given by h R, (x) = ( 1 if argmin 1appleiapplen d(x, p i ) 2 N + ( ) 1 otherwise (5.6) and define its signed width function f R, as f R, (x) = min d (x, p j) min d (x, p i). j2n ( ) i2n + ( ) Let P m := PXY m be a probability measure over (X Y)m. For any 0 < <1, with P m - probability at least 1 the following holds for an i.i.d. sample := {(X i,y i )} m i=1 (X Y)m drawn according to P m : for any set of n labeled prototypes (R, ) (X Y) n, where (R, ) may depend on the sample (and in particular, may be a subset of ) for all >0, P (Yf R, (X) apple 0) apple 1 m mx I {Y j f R, (X j ) apple } + (m,, ) (5.7) j=1 where for 2 l+1,0apple l apple L, (m,, ):= s 32 n2 w l m 4 ln 2em + n ln n 2 ln w l 8(C +1) +ln (5.8) where w l := log 2 ( ( l )+1). The size N of the distance space does not enter the bound of Theorem 6 but because the value of w l and C depend on the distance space through the positive entries of the matrix F then these quantities may grow with N (depending on the distance space X ). The value of w l decreases as l := l( )decreases(becausetheintervals l are situated more to the right as l decreases). Thus w l decreases as a step function over the intervals l as increases such that the larger the value of, the lower the interval index l and the lower the value of w l. Thus the upper bound (5.8) decreases as increases (assuming that the change in w l 1 dominates the change in the ln term). To get a feel for the rate of decrease of w l with respect to,seeexampleonp.23of[10]. Since is a parameter that may be chosen after the random sample is drawn, it can depend on the sample, which makes the bound data-dependent.

19 5.1 Comparison with other results Comparison with other results Theorem 6 states an upper bound on the generalization error for learning on any finite q n distance space decreases with respect to m like O 2 ln m. Let us compare this rate to m other works. If the distance space is X = R d and if the prototype set R is a subset of the sample then Theorem 19.6 of [17] gives the following error bound for learning nearest-neighbor classifiers that are based on R: = O s 1 m (d +1)n(n 1) 2! 1 ln m + n +ln q n which is also O 2 ln m,butincontrasttotheorem6isnotdata-dependent,whichin m general makes it looser. In [15], Theorem 1 presents an error bound for learning LVQ on R d which has a dependence q 2 an on the sample margin of the following form O 2,wherea =min d +1, and m bounds the magnitude of each sample point, 0 < < 1 is the sample margin (which is defined 2 similar to the width (2.4)). They claim that this bound is independent of the dimension d, 2 presumably because if is smaller than d +1 then the d +1 factor disappears from the bound. Their proof is not included; however, it appears to be a direct application of fat-shattering error bounds, see for instance, Theorem 4.18 of [16] which bounds the generalization error of learning linear classifiers and has the same dependence on the margin 2.Theseboundsarebasedonaboundonthelogofthe parameter, namely, O -covering 2. number by O(d )andaboundonthefat-shatteringnumberforlinearclassifiersd apple In comparison, the bound of Theorem 6 also depends on but in a non-direct way through w l (as discussed above, l = l( )). Depending on the distance space s positive set S F, w l may decrease even faster than 1 2. Although we assume that the distance space is finite of cardinality N, thebound(5.8)mayor may not grow with N and this depends on the matrix F.Incomparisontotheaboveworks, in (5.8) there is an implicit complexity quantity which enters through w l and is defined as ( l ). This is an upper bound on the pseudo-rank of the perturbed matrix F,denotedby F al,(see[10]),whichisthenumberofdistinctcolumnsofamatrixsgn(f al )andmaydepend on N (depending on the definition of the distance space). 6 Conclusions We use the concept of width to learn a family of classifiers based on the nearest-prototype decision rule over an arbitrary finite distance spaces (a significantly more general and perhaps

20 20 REFERENCES more applicable setting than that of metric spaces). In this setting a classifier is represented by a set of n prototypes, that can be any labeled points in the distance space, and even a subset of the sample. We obtain an upper bound on the generalization error of any such nearest-prototype classifier. Using as the width parameter, the error bound depends on the -empirical error and holds uniformly over the family of such classifiers. Ignoring the q n dependence on,theboundiso 2 ln m.thedependenceon is more subtle because m it enters through a complexity quantity ( l )whichboundsfromabovethepseudo-rankof a -perturbed version of a matrix that represents all half spaces in the distance space. Acknowledgements This work was supported in part by a research grant from the Suntory and Toyota International Centres for Economics and Related Disciplines at the London School of Economics. The authors thank the referees for their comments which resulted in shorter proofs of Propositions 2 and 4. References [1] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, [2] M. Anthony and P. L. Bartlett. Function learning from interpolation. Combinatorics, Probability, and Computing, 9: ,2000. [3] M. Anthony and J. Ratsaby. Classification based on prototypes with spheres of influence. Information and Computation, 256: ,2017. [4] M. Anthony and J. Ratsaby. Maximal width learning of binary functions. Theoretical Computer Science, 411: ,2010. [5] M. Anthony and J. Ratsaby. A hybrid classifier based on boxes and nearest neighbors. Discrete Applied Mathematics, 172:1 11,2014. [6] M. Anthony and J. Ratsaby. Analysis of a multi-category classifier. Discrete Applied Mathematics, 160(16-17): ,2012. [7] M. Anthony and J. Ratsaby. Learning bounds via sample width for classifiers on finite metric spaces. Theoretical Computer Science, 529:2 10,2014. [8] M. Anthony and J. Ratsaby. Multi-category classifiers and sample width. Journal of Computer Systems and Sciences, 82(8): ,2016. [9] M. Anthony and J. Ratsaby. A probabilistic approach to case-based inference. Theoretical Computer Science, 589:61 75,2015. [10] M. Anthony and J. Ratsaby. Large-width bounds for learning half-spaces on distance spaces. Submitted, 2017.

21 REFERENCES 21 [11] A. Belousov and J. Ratsaby. Massively parallel computations of the LZ-complexity of strings,. In Proc. of the 28th IEEE Convention of Electrical and Electronics Engineers in Israel (IEEEI 14), pagespp.1 5,Eilat,Dec [12] A. Belousov and J. Ratsaby. A parallel distributed processing algorithm for image feature extraction. In Advances in Intelligent Data Analysis XIV - 14th International Symposium, IDA 2015, Saint-Etienne, France, October 22-24, Proceedings, volume 9385 of Lecture Notes in Computer Science. Springer,2015. [13] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 36(4): , [14] T. Cover and P. Hart. Nearest neighbor pattern classification. Information Theory, IEEE Transactions on, 13(1):21 27, January [15] K. Crammer, R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin analysis of the LVQ algorithm. In Advances in Neural Information Processing Systems 15 [Neural Information Processing Systems, NIPS 2002, December 9-14, 2002, Vancouver, British Columbia, Canada], pages ,2002. [16] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and other Kernel-based learning methods. Cambridge University Press, [17] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer Verlag, [18] E. Deza M. Deza. Encyclopedia of Distances, volume15ofseries in Computer Science. Springer-Verlag, [19] R. M. Dudley. Uniform Central Limit Theorems, Cambridge Studies in Advanced Mathematics, 63, Cambridge University Press, Cambridge, UK, [20] B. Hammer, M. Strickert, and T. Villmann. On the generalization ability of GRLVQ networks. Neural Processing Letters, 21(2): ,2005. [21] T. Kohonen. Self-Organizing Maps. Springer,2001. [22] E. Pekalska and R. P. W. Duin The Dissimilarity Representation for Pattern Recognition : Foundations and Applications. World Scientific Publishing Co Pte Ltd, Singapore, SINGAPORE, [23] D. Pollard (1984). Convergence of Stochastic Processes. Springer-Verlag. [24] P. Schneider, M. Biehl, and B. Hammer. Adaptive relevance matrices in learning vector quantization. Neural Computation, 21(12): ,2009. [25] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1996: [26] A. J. Smola, P. L. Bartlett, B. Scholkopf, and D. Schuurmans. Advances in Large-Margin Classifiers (Neural Information Processing). MITPress,2000. [27] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

22 22 REFERENCES [28] V.N. Vapnik and A.Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2), , 1971.

Sample width for multi-category classifiers

R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University