Structural Risk Minimization over Data-Dependent Hierarchies

Size: px

Start display at page:

Download "Structural Risk Minimization over Data-Dependent Hierarchies"

Julia Stafford
5 years ago
Views:

1 Structural Risk Minimization over Data-Depenent Hierarchies John Shawe-Taylor Department of Computer Science Royal Holloway an Befor New College University of Lonon Egham, TW20 0EX, UK Peter L. Bartlett Department of Systems Engineering Australian National University Canberra 0200 Australia Robert C. Williamson Department of Engineering Australian National University Canberra 0200 Australia Martin Anthony Department of Mathematics Lonon School of Economics Houghton Street Lonon WC2A 2AE, UK November 28, 1997 Abstract The paper introuces some generalizations of Vapnik s metho of structural risk minimisation (SRM). As well as making explicit some of the etails on SRM, it provies a result that allows one to trae off errors on the training sample against improve generalization performance. It then consiers the more general case when the hierarchy of classes is chosen in response to the ata. A result is presente on the generalization performance of classifiers with a large margin. This theoretically explains the impressive generalization performance of the maximal margin hyperplane algorithm of Vapnik an co-workers (which is the basis for their support vector machines). The paper conclues with a more general result in terms of luckiness functions, which provies a quite general way for exploiting serenipitous simplicity in observe ata to obtain better preiction accuracy from small training sets. Four examples are given of such functions, incluing the VC imension measure on the sample. 1

2 Keywors: Learning Machines, Maximal Margin, Support Vector Machines, Probable Smooth Luckiness, Uniform Convergence, Vapnik-Chervonenkis Dimension, Fat Shattering Dimension, Computational Learning Theory, Probably Approximately Correct Learning. 1 Introuction The stanar Probably Approximately Correct (PAC) moel of learning consiers a fixe hypothesis class H together with a require accuracy an confience 1 ;. The theory characterises when a target function from H can be learne from examples in terms of the Vapnik- Chervonenkis imension, a measure of the flexibility of the class H an specifies sample sizes require to eliver the require accuracy with the allowe confience. In many cases of practical interest the precise class containing the target function to be learne may not be known in avance. The learner may only be given a hierarchy of classes H 1 H 2 H an be tol that the target will lie in one of the sets H. Structural Risk Minimization (SRM) copes with this problem by minimizing an upper boun on the expecte risk, over each of the hypothesis classes. The principle is a curious one in that in orer to have an algorithm it is necessary to have a goo theoretical boun on the generalization performance. A formal statement of the metho is given in the next section. Linial, Mansour an Rivest [29] stuie learning in a framework as above by allowing the learner to seek a consistent hypothesis in each subclass H in turn, rawing enough extra examples at each stage to ensure the correct level of accuracy an confience shoul a consistent hypothesis be foun. This paper 1 aresses two shortcomings of Linial et al. s approach. The first is the requirement to raw extra examples when seeking in a richer class. It may be unrealistic to assume that examples can be obtaine cheaply, an at the same time it woul be foolish not to use as many examples as are available from the start. Hence, we suppose that a fixe number of examples is allowe an that the aim of the learner is to boun the expecte generalization error with high confience. The secon rawback of the Linial et al. approach is that it is not clear how it can be aapte to hanle the case where errors are allowe on the training set. In this situation there is a nee to trae off the number of errors with the complexity of the class, since taking a class which is too complex can result in a worse generalization error (with a fixe number of examples) than allowing some extra errors in a more restricte class. The moel we consier allows a precise boun on the error arising in ifferent classes an hence a reliable way of applying the structural risk minimisation principle introuce by Vapnik [48, 50]. Inee, the results reporte in Sections 2 an 3 of this paper are implicit in the cite references, but our treatment serves to introuce the main results of the paper in later sections, an we make explicit some of the assumptions implicit in the presentations in [48, 50]. A more recent paper by Lugosi an Zeger [38] consiers stanar SRM an provies bouns for the true error of the hypothesis with lowest empirical error in each class. Whereas our Theorem 2.3 gives an error boun that ecreases to twice the empirical error roughly linearly with the ratio 1 Some of the results of this paper appeare in [43]. 2

3 of the VC imension to the number of examples, they give an error boun that ecreases to the empirical error itself, but as the square root of this ratio. From Section 4 onwars we aress a shortcoming of the SRM metho which Vapnik [48, page 161] highlights: accoring to the SRM principle the structure has to be efine a priori before the training ata appear. An algorithm using maximally separating hyperplanes propose by Vapnik [46] an co-workers [14, 16] violates this principle in that the hierarchy efine epens on the ata. In Section 4 we prove a result which shows that if one achieves correct classification of some training ata with a class of f0 1g-value functions which are threshole, an if the values of the real-value functions on the training points are all well away from zero, then there is a boun on the generalization error which can be much better than the one obtaine from the VC-imension of the threshole class. In Section 5 we apply this to the case consiere by Vapnik: separating hyperplanes with a large margin. In Section 6 we introuce a more general framework which allows a rather large class of methos of measuring the luckiness of a sample, in the sense that the large margin is lucky. In Section 7 we explictly show how Vapnik s maximum margin hyperplanes fit into this general framework, which then also allows the raius of the set of points to be estimate from the ata. In aition, we show that the function which measures the VC imension of the set of hypotheses on the sample points is a vali (un)luckiness function. This leas to a boun on the generalization performance in terms of this measure imension rather than the worst case boun which involves the VC imension of the set of hypotheses over the whole input space. Our approach can be interprete as a general way of encoing our bias, or prior assumptions, an possibly taking avantage of them if they happen to be correct. In the case of the fixe hierarchy, we expect the target (or a close approximation to it) to be in a class H with small. In the maximal separation case, we expect the target to be consistent with some classifying hyperplane that has a large separation from the examples. This correspons to a collusion between the probability istribution an the target concept, which woul be impossible to exploit in the stanar PAC istribution inepenent framework. If these assumptions happen to be correct for the training ata, we can be confient we have an accurate hypothesis from a small ata set (at the expense of some small penalty if they are incorrect). A commonly stuie relate problem is that of moel orer selection (see for example [34]), an we here briefly make some remarks on the relationship with the work presente in this paper. Assuming the above hierarchy of hypothesis classes, the aim there is to ientify the best class inex. Often best in this literature simply means correct in the sense that if in fact the target hypothesis h 2 H i, then as the sample size grows to infinity, the selection proceure will (in some probabilistic sense) pick i. Other methos of complexity regularization can be seen to also solve similar problems. (See for example [20, 6, 7, 8].) We are not aware of any methos (apart from SRM) for which explicit finite sample size bouns on their performance are available. Furthermore, with the exception of the methos iscusse in [8], all such methos take the form of minimizing a cost function comprising an empirical risk plus an aitive complexity term which oes not epen on the ata. We enote logarithms to base 2 by log, an natural logarithms by ln. If S is a set, jsj enotes its carinality. We o not explictly state the measurability conitions neee for our arguments to hol. We assume with no further iscussion permissibility of the function classes involve (see Appenix C of [41] an section 2.3 of [45]). 3

4 2 Stanar Structural Risk Minimisation As an initial example we consier a hierarchy of classes H 1 H 2 ::: H ::: where H i f0 1g X for some input space X, an where we will assume VCim(H ) = for the rest of this section. (Recall that the VC-imension of a class of f0 1g-value functions is the size of the largest subset of their omain for which the restriction of the class to that subset is the set of all f0 1g-value functions; see [49].) Such a hierarchy of classes is calle a ecomposable concept class by Linial et al. [29]. Relate work is presente by Beneek an Itai [12]. We will assume that a fixe number m of labelle examples are given as a vector z = (x t(x)) to the learner, where x =(x 1 ::: x m ), an t(x) =(t(x 1 ) ::: t(x m )), an that the target function t lies in one of the subclasses H. The learner uses an algorithm to fin a value of which contains an hypothesis h that is consistent with the sample z. What we require is a function (m ) which will give the learner an upper boun on the generalization error of h with confience 1 ;. The following theorem gives a suitable function. We use Er z (h) =jfi : h(x i ) 6= t(x i )gj to enote the number of errors that h makes on z, an er P (h) =P fx: h(x) 6= t(x)g to enote the expecte error when x 1 ::: x m are rawn inepenently accoring to P. In what follows we will often write Er x (h) (rather than Er z (h)) when the target t is obvious from the context. The following theorem, which appears in [43], covers the case where there are no errors on the training set. It is a well-known result which we quote for completeness. Theorem 2.1 [43] Let H i, i = 1 2 ::: be a sequence of hypothesis classes mapping X to f0 1g such that VCim(H i ) = i, an P let P be a probability istribution on X. Let p be any 1 set of positive numbers satisfying p =1 = 1: With probability 1 ; over m inepenent examples rawn accoring to P, for any for which a learner fins a consistent hypothesis h in H, the generalization error of h is boune from above by 2em provie m. (m ) = 4 m ln +ln 1 p +ln The role of the numbers p may seem a little counter-intuitive as we appear to be able to bias our estimate by ajusting these parameters. The numbers must, however, be specifie in avance an represent some apportionment of our confience to the ifferent points where failure might occur. In this sense they shoul be one of the arguments of the function (m ). Wehave eliberately omitte this epenence as they have a ifferent status in the learning framework. It is helpful to think of p as our prior estimate of the probability that the smallest class containing a consistent hypothesis is H. In particular we can set p =0for >m, since we woul expect to be able to fin a consistent hypothesis in H m an if we fail the boun will not be useful for such large in any case. We also wish to consier the possibility of errors on the training sample. The result presente here is analagous to those obtaine by Lugosi an Zeger [37] in the statistical framework. We will make use of the following result of Vapnik in a slightly improve version ue to Anthony an Shawe-Taylor [4]. Note also that the result is expresse in terms of the quantity 4 4

5 Er z (h) which enotes the number of errors of the hypothesis h on the sample z, rather than the usual proportion of errors. Theorem 2.2 ([4]) Let 0 < < 1 an 0 < 1. Suppose H is an hypothesis space of functions from an input space X to f0 1g, an let be any probability measure on S = X f0 1g. Then the probability (with respect to m ) that for z 2 S m, there is some h 2 H such that er (h) > an Er z (h) m(1 ; )er (h) is at most 4 H (2m)exp ; 2 m 4 Our aim will be to use a ouble stratification of ; as before by class (via p ), an also by the number of errors on the sample (via q k ). The generalization error will be given as a function of the size of the sample m, inex of the class, the number of errors on the sample k, an the confience. Theorem 2.3 Let H i, i = 1 2 :::, be a sequence of hypothesis classes mapping X to f0 1g an having VC-imension i. Let be any probability measure on S = X f0 1g, an let p q k be any sets of positive numbers satisfying 1X =1 p =1 P m an q k=0 k = 1 for all. Then with probability 1 ; over m inepenent ientically istribute examples x, if the learner fins an hypothesis h in H with Er x (h) = k, then the generalization error of h is boune from above by (m k ) = 1 4 2em 2k +4ln +4ln m p q k provie m. Proof : We boun the require probability of failure m fz : 9 k 9h 2 H Er z (h) =k er (h) >(m k )g < by showing that for all an k m fz : 9h 2 H Er z (h) =k er (h) >(m k )g <p q k : We will apply Theorem 2.2 once for each value of k an. We must therefore ensure that only one value of = k is use in each case. An appropriate value is k =1; k m(m k ) : : 5

6 This ensures that if er (h) >(m k ) an Er z (h) =k, then Er z (h) =k = m(1 ; k )(m k ) m(1 ; k )er (h) as require for an application of the theorem. Hence, if m Sauer s lemma implies ( 4 m fz : 9h 2 H Er z (h) =k er (h) >(m k )g <p q k 2em exp ; 2 k(m k )m <p q k 4 2em 4 (m k )m +ln ; + 2k p q k 4 4 =0 4 2em 2k +4ln +4ln p q k ( ln ( (m k ) = 1 m ignoring one term of k 2 =(4m). The result follows. The choice of the prior q k for ifferent k will again affect the resulting trae-off between complexity an accuracy. In view of our expectation that the penalty term for choosing a large class is probably an overestimate, it seems reasonable to give a corresponingly large penalty for a large numbers of errors. One possibility is an exponentially ecreasing prior istribution such as q k =2 ;(k+1) though the rate of ecrease coul also be varie between classes. Assuming the above choice, observe that an incremental search for the optimal value of woul stop when the reuction in the number of classification errors in the next class was less than 2em 0:84 ln : Note that the traeoff between errors on the sample an generalization error is also iscusse in [16]. 3 Classifiers with a Large Margin The stanar methos of structural risk minimization require that the ecomposition of the hypothesis class be chosen in avance of seeing the ata. In this section we introuce our first variant of SRM which effectively makes a ecomposition after the ata has been seen. The main tool we use is the fat-shattering imension, which was introuce in [26], an has been use for several problems in learning since [1, 11, 2, 10]. We show that if a classifier correctly classifies a training set with a large margin, an if its fat-shattering function at a scale relate to this margin is small, then the generalization error will be small. (This is formally state in Theorem 3.9 below.) Definition 3.1 Let F be a set of real value functions. We say that a set of points X is - shattere by F if there are real numbers r x inexe by x 2 X such that for all binary vectors b 6

7 inexe by X, there is a function f b 2Fsatisfying rx + if b x =1 f b (x) r x ; otherwise: The fat shattering imension fat F of the set F is a function from the positive real numbers to the integers which maps a value to the size of the largest -shattere set, if this is finite or infinity otherwise. Let T enote the threshol function at, T : R!f0 1g, T () =1 iff >. Fix a class of [0 1]-value functions. We can interpret each function f in the class as a classification function by consiering the threshole version, T 1=2 f. The following result implies that, if a realvalue function in the class maps all training examples to the correct sie of 1=2 by a large margin, the misclassification probability of the threshole version of the function epens on the fat-shattering imension of the class, at a scale relate to the margin. This result is a special case of Corollary 6 in [2], which applie more generally to arbitrary real-value target functions. (This application to classification problems was not escribe in [2].) Theorem 3.2 Let H be a set of [0 1]-value functions efine on a set X. Let 0 < < 1=2. There is a positive constant K such that, for any function t : X! f0 1g an any probability istribution P on X, with probability at least 1 ; over a sequence x 1 ::: x m of examples chosen inepenently accoring to P, every h in H that has jh(x i ) ; t(x i )j < 1=2 ; for i =1 ::: msatisfies provie that Pr (jh(x) ; t(x)j 1=2) < m K log 1 + log2 where = fat H (=8). Clearly, this implies that the misclassification probability is less than uner the conitions of the theorem, since T 1=2 (h(x)) 6= t(x) implies jh(x) ; t(x)j 1=2. In the remainer of this section, we present an improvement of this result. By taking avantage of the fact that the target values fall in the finite set f0 1g, an the fact that only the behaviour near the threshol of functions in H is important, we can remove the = factor from the log 2 factor in the boun. We also improve the constants that woul be obtaine from the argument use in the proof of Theorem 3.2. Before we can quote the next lemma, we nee another efinition. Definition 3.3 Let (X ) be a (pseuo-) metric space, let A be a subset of X an >0. A set B X is an -cover for A if, for every a 2 A, there exists b 2 B such that (a b) <. The -covering number of A, N ( A), is the minimal carinality of an -cover for A (if there is no such finite cover then it is efine to be 1). 7

8 The iea is that B shoul be finite but approximate all of A with respect to the pseuometric. As in [2], we will use the l 1 istance over a finite sample x =(x 1 ::: x m ) for the pseuometric in the space of functions, x (f g) = max jf (x i ) ; g(x i )j: i We write N ( F x) for the -covering number of F with respect to the pseuo-metric x. We now quote a lemma from Alon et al. [1] which we will use below. Lemma 3.4 (Alon et al. [1]) Let F be a class of functions X! [0 1] an P a istribution over X. Choose 0 <<1an let = fat F (=4). Then log(2em=()) 4m E (N ( F x)) 2 where the expectation E is taken w.r.t. a sample x 2 X m rawn accoring to P m. Corollary 3.5 Let F be a class of functions X! [a b] an P a istribution over X. Choose 0 <<1an let = fat F (=4). Then 4m(b ; a) 2 log(2em(b;a)=()) E (N ( F x)) 2 where the expectation E is over samples x 2 X m rawn accoring to P m. Proof : We first scale all the functions in F by the affine transformation mapping the interval [a b] to [0 1] to create the set of functions F 0. Clearly, fat F 0() = fat F ((b ; a)), while E(N ( F x)) = E(N (=(b ; a) F 0 x)): The result follows. In orer to motivate the next lemma we first introuce some notation we will use when we come to apply it. The aim is to transform the problem of observing a large margin into one of observing the maximal value taken by a set of functions. We o this by foling over the functions at the threshol. The following hat operator implements the foling. We efine the mapping ^: R X! R Xf0 1g by ^: f 7! ^f (x c) =f (x)(1 ; c) +(2 ; f (x))c for some fixe real. For a set of functions F, we efine ^F = ^F = f ^f : f 2 Fg. The iea behin this mapping is that for a function f, the corresponing ^f maps the input x an it classification c to an output value, which will be less than provie the classification obtaine by thresholing f (x) at is correct. Lemma 3.6 Suppose F is a set of functions that map from X to R with finite fat-shattering imension boune by the function afat : R! N which is continuous from the right. Then for any istribution P on X, an any k 2 N an any 2 R P (xy: 2m 9f 2F r = maxff (x j )g 2 = ; r afat(=4) = k j where (m k ) = 1 8em (k log log(32m) + log 2 ). m k m jfi : f (y i) r +2gj >(m k ) 8 ) <

9 Proof : Using the stanar permutation argument (as in [49]), we may fix a sequence xy an boun the probability uner the uniform istribution on swapping permutations that the permute sequence satisfies the conition state. Let k = minf 0 : afat( 0 =4) kg. Notice that the minimum is efine since afat is continuous from the right, an also that afat( k =4) = afat(=4). For any satisfying afat(=4) = k, wehave k, so the probability above is no greater than P 2m (xy : 9 2 R + afat(=4) = k 9f 2F A f (2 k ) where A f () is the event that f (y i ) max j ff (x j )g + for at least m(m k ) points y i in y. Note that r +2 =. Let () := 8 < : ; 2 k if > if <; 2 k otherwise, an let (F) = f(f ): f 2 Fg. Consier a minimal k -cover B xy of (F) in the pseuometric xy. We have that for any f 2F, there exists ~ f 2 B xy, with j(f )(x);( ~ f)(x)j < k for all x 2 xy. Thus since for all x 2 x, by the efinition of r, f (x) r = ; 2, (f )(x) = ; 2 k = r +2( ; k ), an so ( ~ f )(x) <r+2 ; k. However there are at least m(m k ) points y 2 y such that f (y) = r +2,so( f)(y) ~ >r+2 ; k > max j f( f ~ )(x j )g. Since only reuces separation between output values, we conclue that the event A f ~ (0) occurs. By the permutation argument, for fixe f ~ at most 2 ;(m k )m of the sequences obtaine by swapping corresponing points satisfy the conitions, since the m points with the largest f ~ values must remain on the right han sie for A f ~ (0) to occur. Thus by the union boun P 2m (xy : 9 2 R + afat(=4) = k 9f 2F A f (2 k ) ) ) E(jB xy j)2 ;(m k )m where the expectation is over xy rawn accoring to P 2m. Now for all > 0, fat (F ) () fat F () since every set of points -shattere by (F) can be -shattere by F. Furthermore, (F) is a class of functions mapping a set X to the interval [;2 k ]. Hence, by Corollary 3.5 (setting [a b] to [ ; 2 k ], to k, an m to 2m), 8m( ; +2k ) 2 log(4em(2k )=( k )) E(jB xy j)=e(n ( k (F) xy)) 2 where = fat (F ) ( k =4) fat F ( k =4) k. Thus an so E(jB xy j)2 ;(m k )m <provie 2 k E(jB xy j) 2(32m) k log(8em=k) as require. (m k ) 1 m ; k log(8em=k) log(32m) +log 2 9

10 The function afat() is use in this theorem rather than fat F () since we use the continuity property to ensure that afat( k =4) = k for every k, while we cannot assume that fat F () is continuous from the right. We coul avoi this requirement an give an error estimate irectly in terms of fat F instea of afat, but this woul introuce a worse constant in the argument of fat F. Since in practice one works with continuous upper bouns on fat F () (e.g. c= 2 ) by taking the floor of the value, the critical question becomes whether the boun is strict rather than less than or equal. Provie fat F is strictly less than the continuous boun the corresponing floor function is continuous from the right. If not aition of an arbitrarily small constant to the continuous function will allow substitution of a strict inequality. Lemma 3.7 Let F be a set of real value functions from X to R. Then for all 0, fat ^F () =fat F (): Proof : For any c 2 f0 1g m, we have that f b realises ichotomy b on x = (x 1 ::: x m ) with margin about output values r i if an only if ^f b realises ichotomy b c on with margin about output values ^x =((x 1 c 1 ) ::: (x m c m )) ^r i = r i (1 ; c i )+(2 ; r i )c i : We will make use of the following lemma, which in the form below is ue to Vapnik [46, page 168]. Lemma 3.8 Let X be a set an S a system of sets on X, an P a probability measure on X. For x 2 X m an A 2 S, efine x (A) :=jx \ Aj=m. Ifm>2=, then P x: m sup j x (A) ; P (A)j > A2S 2P 2m xy: sup j x (A) ; y (A)j >=2 A2S Let T enote the threshol function at : T : R!f0 1g, T () =1iff >. For a class of functions F, T (F) =ft (f ): f 2Fg. Theorem 3.9 Consier a real value function class F having fat shattering function boune above by the function afat : R! N which is continuous from the right. Fix 2 R. If a learner correctly classifies m inepenently generate examples z with h = T (f ) 2 T (F) such that er z (h) = 0 an = min jf (x i ) ; j, then with confience 1 ; the expecte error of h is boune from above by (m k ) = 2 8em 8m k log log(32m) + log m k where k = afat(=8). : 10

11 Proof : The proof will make use of lemma 3.8. First we will move to a ouble sample an stratify by k. By the union boun, it thus suffices to show that where P 2m ( 2m[ k=1 J k ) 2mX k=1 P 2m (J k ) <=2 J k = fxy : 9h = T (f ) 2 T (F) Er x (h) =0 k = afat(=8) = min jf (x i ) ; j Er y (h) m(m k )=2g: (The largest value of k we nee consier is 2m, since we cannot shatter a greater number of points from xy.) It is sufficient if P 2m (J k ) 4m = 0 : Consier ^F = ^F an note that by Lemma 3.7 the function afat() also bouns fat ^F (). The probability istribution on ^X = X f0 1g is given by P on X with the secon component etermine by the target value of the first component. Note that for a point y 2 y to be misclassifie, it must have so that J k n ^f (^y) maxf ^f (^x): ^x 2 ^xg + = ^x^y 2 (X f0 1g) 2m : 9 ^f 2 ^F r = maxf ^f (^x): ^x 2 ^xg = ; r Replacing by =2 in Lemma 3.6 we obtain k = afat(=8) f^y 2 ^y: ^f (^y) g m(m k )=2 o P 2m (J k ) 0 for (m k ) = 2 m (k log(8em=k) log(32m) +log(2=0 )) : With this linking of an m, the conition S of Lemma P 3.8 is satisfie. Appealling to this an noting that the union boun gives P 2m 2m ( J 2m k=1 k) P 2m (J k=1 k ) we conclue the proof by substituting for 0. A relate result, that gives bouns on the misclassification probability of threshole functions in terms of an error estimate involving the margin of the corresponing real-value functions, is given in [9]. Using this result an bouns on the fat-shattering imension of sigmoial neural networks, that paper also gives bouns on the generalization performance of these networks that epen on the size of the parameters but are inepenent of the number of parameters. 11

12 4 Large Margin Hyperplanes We will now consier a particular case of the results in the previous section, applicable to the class of of linear threshol functions in Eucliean space. Vapnik an others [46, 48, 14, 16], [18, page 140] have suggeste that choosing the maximal margin hyperplane (i.e. the hyperplane which maximises the minimal istance of points assuming a correct classification can be mae) will improve the generalization of the resulting classifier. They give evience to inicate that the generalization performance is frequently significantly better than that preicte by the VC imension of the full class of linear threshol functions. In this section of the paper we will show that inee a large margin oes help in this case, an we will give an explicit boun on the generalization error in terms of the margin achieve on the training sample. We o this by first bouning the appropriate fat-shattering function, an then applying theorem 3.9. The margin also arises in the proof of the perceptron convergence theorem (see for example [23, page 61 62], where an alternate motivation is given for a large margin: noise immunity). The margin occurs even more explictly in the Winnow algorithms an their variants evelope by Littlestone an others [30, 31, 32]. The connection between these two uses has not yet been explore. Consier a hyperplane efine by (w ), where w is a weight vector an a threshol value. Let X 0 be a subset of the Eucliean space that oes not have a limit point on the hyperplane, so that min x2x 0 jhx wi + j > 0: We say that the hyperplane is in canonical form with respect to X 0 if min jhx wi + j =1: x2x 0 Let kkenote the Eucliean norm. The maximal margin hyperplane is obtaine by minimising kwk subject to these constraints. The points in X 0 for which the minimum is attaine are calle the support vectors of the maximal margin hyperplane. The following theorem is the basis for our argument for the maximal margin analysis. Theorem 4.1 (Vapnik [48]) Suppose X 0 is a subset of the input space containe in a ball of raius R about some point. Consier the set of hyperplanes in canonical form with respect to X 0 that satisfy kwk A, an let F be the class of corresponing linear threshol functions, f (x w) =sgn(hx wi + ): Then the restriction of F to the points in X 0 has VC imension boune by minfr 2 A 2 ng +1: Our argument will also be in terms of Theorem 3.9, an to that en we nee to boun the fatshattering imension of the class of hyperplanes. We o this via an argument concerning the level fat-shattering imension, efine below. Definition 4.2 Let F be a set of real value functions. We say that a set of points X is level -shattere by F at level r if it can be -shattere when choosing the r x = r for all x 2 X. The 12

13 level fat shattering imension lfat F of the set F is a function from the positive real numbers to the integers which maps a value to the size of the largest level -shattere set, if this is finite or infinity otherwise. The level fat-shattering imension is a scale sensitive version of a imension introuce by Vapnik [46]. The scale sensitive version was first introuce by Alon et al. [1]. Lemma 4.3 Let F be the set of linear functions with unit weight vectors, restricte to points in a ball of raius R, F = fx 7! hw xi + : kwk =1g : (1) Then the level fat shattering function can be boune from above by lfat F () minfr 2 = 2 ng +1: Proof : If a set of points X = fx i g i is to be level -shattere there must be a value r such that each ichotomy b can be realise with a weight vector w b an threshol b such that r hw b x i i + b + if bi =1 r ; otherwise. Let = min x2x jhw b xi + b ; rj. Consier the hyperplane efine by (w b b ) = (w b = b = ; r=). It is in canonical form with respect to the points X, satisfies kw b k = kw b =k =1= an realises ichotomy b on X. Hence, the set of points X can be shattere by a subset of canonical hyperplanes (w b b ) satisfying kw b k 1= 1=. The result follows from Theorem 4.1. Corollary 4.4 Let F be the set, efine in (1), of linear functions with unit weight vectors, restricte to points in a ball of n imensions of raius R about the origin an with threshols jj R. The fat shattering function of F can be boune by fat F () minf9r 2 = 2 n+1g +1: Proof : Suppose m points x 1 ::: x m lying in a ball of raius R about the origin are -shattere relative to r =(r 1 ::: r m ). Since kwk =1, jhw x i i + j 2R, an so jr i j2r. From each x i, i = 1 ::: m, we create an extene vector p x i := (x i 1 ::: xi n r i= 2). Since jr i j 2R, kx i k p 3R. Let (w b b ) be the parameter vector of the hyperplane that realizes a ichotomy b 2f0 1g m. Set w b =(w b 1 ::: wb n ;p 2). We now show that the points x i, i = 1 ::: mare level -shattere at level 0 by fw b g b2f0 1g m. We have that hw b x i i + b = hw b x i i + b ; r i =: t. But hw b x i i + b r i + if b i =1, an hw b x i i + b r i ; if b i =0. Thus t r i + ; r i = if b i =1 t r i ; ; r i = if b i =0: p Now kw b k = 3. Set p p ~w b = w b = 3, an ~x i = 3x i. Then k ~w b k = 1 an the points ~x i, i = 1 ::: m are level -shattere at level 0 by f ~w b g b2f0 1g m. Since im ~x i = n +1 an k~x i k p 3 p3r =3R, we have by Lemma 4.3 that fat F () minf 9R n+1g +1.

14 Theorem 4.5 Suppose inputs are rawn inepenently accoring to a istribution whose support is containe in a ball in R n centere at the origin, of raius R. If we succee in correctly classifying m such inputs by a canonical hyperplane with kwk = 1= an with jj R, then with confience 1 ; the generalization error will be boune from above by (m ) = 2 8em k log log(32m) +log 8m m k where k = b577r 2 = 2 c. Proof : Firstly note that we can restrict our consieration to the subclass of F with jj R. If there is more than one point to be -shattere, then it is require to achieve a ichotomy with ifferent signs; that is b is neither all 0s nor all 1s. Since all of the points lie in the ball, to shatter them the hyperplane must intersect the ball. Since kwk =1, that means jj R. So although one may achieve a greater margin for the all-zero or all-one ichotomy by choosing a larger value of, all of the other ichotomies cannot achieve a larger. Thus although the boun may be weak in the special case of an all 0 or all 1 classification on the training set, it will still be true. Hence, we are now in a position to apply Theorem 3.9 with the value of given in the theorem taken as 0. Hence, fat F (=8) b576r 2 = 2 +1c < b577r 2 = 2 c since <R. Substituting into the boun of Theorem 3.9 gives the require boun. In section 6 we will give an analogous result as a special case of the more general framework erive in section 5. Although the sample size boun for that result is weaker (by an aitional log(m) factor), it oes allow one to cope with the slightly more general situation of estimating the raius of the ball rather than knowing it in avance. The fact that the boun in Theorem 4.5 oes not epen on the imension of the input space is particularly important in the light of Vapnik s ingenious construction of his support-vector machines [16, 48]. This is a metho of implementing quite complex ecision rules (such as those efine by polynomials or neural networks) in terms of linear hyperplanes in very many imensions. The clever part of the technique is the algorithm which can work in a ual space, an which maximizes the margin on a training set. Thus Vapnik s algorithm along with the boun of Theorem 4.5 shoul allow goo a posteriori bouns on the generalization error in a range of applications. It is important to note that our explanation of the goo performance of maximum margin hyperplanes is ifferent to that given by Vapnik in [48, page 135]. Whilst alluing to the result of theorem 4.1, the theorem he presents as the explanation is a boun on the expecte generalization error in terms of the number of support vectors. A small number of support vectors gives a goo boun. One can construct examples in which all four combinations of small/large margin an few/many support vectors occur. Thus neither explanation is the only one. In the terminology of the next section, the margin an (the reciprocal of) the number of support vectors are both luckiness functions, an either coul be use to etermine bouns on performance. 14

15 5 Luckiness: A General Framework for Decomposing Classes The stanar PAC analysis gives bouns on generalization error that are uniform over the hypothesis class. Decomposing the hypothesis class, as escribe in Section 2, allows us to bias our generalization error bouns in favour of certain target functions an istributions: those for which some hypothesis low in the hierarchy is an accurate approximation. The results of section 4 show that it is possible to ecompose the hypothesis class on the basis of the observe ata in some cases: there we i it in terms of the margin attaine. In this section, we introuce a more general framework which subsumes the stanar PAC moel, the framework escribe in Section 2 an can recover (in a slightly weaker form) the results of Section 4 as a special case. This more general ecomposition of the hypothesis class base on the sample allows us to bias our generalization error bouns in favour of more general classes of target functions an istributions, which might correspon to more realistic assumptions about practical learning problems. It seems that in orer to allow the ecomposition of the hypothesis class to epen on the sample, we nee to make better use of the information provie by the sample. Both the stanar PAC analysis an structural risk minimisation with a fixe ecomposition of the hypothesis class effectively iscar the training examples, an only make use of the function Er z efine on the hypothesis class that is inuce by the training examples. The aitional information we exploit in the case of sample-base ecompositions of the hypothesis class is encapsulate in a luckiness function. The main iea is to fix in avance some assumption about the target function an istribution, an encoe this assumption in a real-value function efine on the space of training samples an hypotheses. The value of the function inicates the extent to which the assumption is satisfie for that sample an hypothesis. We call this mapping a luckiness function, since it reflects how fortunate we are that our assumption is satisfie. That is, we make use of a function L : X m H! R + which measures the luckiness of a particular hypothesis with respect to the training examples. Sometimes it is convenient to express this relationship in an inverte way, as an unluckiness function, U : X m H! R + : It turns out that only the orering that the luckiness or unluckiness functions impose on hypotheses is important. We efine the level of a function h 2 H relative to L an x by the function `(x h)=jfb 2f0 1g m : 9g 2 H g(x) =b L(x g) L(x h)gj or `(x h)=jfb 2f0 1g m : 9g 2 H g(x) =b U(x g) U(x h)gj: Whether `(x h) is efine in terms of L or U is a matter of convenience; the quantity `(x h) itself plays the central role in what follows. If x y 2 X m, we enote by xy their concatenation (x 1 ::: x m y 1 ::: y m ). 15

16 5.1 Examples Example 5.1 Consier the hierarchy of classes introuce in Section 2 an efine U(x h) = minf : h 2 H g: Then it follows from Sauer s lemma that for any x we can boun `(x h) by em `(x h) where = U(x h). Notice also that for any y 2 X m, 2em `(xy h) : The last observation is something that will prove useful later when we investigate how we can use luckiness on a sample to infer luckiness on a subsequent sample. We show in Section 6 that the hyperplane margin of Section 5 is a luckiness function which satisfies the technical restrictions we introuce below. We o this in fact in terms of the following unluckiness function, efine formally here for convenience later on. Definition 5.2 If h is a linear threshol function with separating hyperplane efine by (w ), an (w ) is in canonical form with respect to an m-sample x, then efine U(x h)= max 1im kx ik 2 kwk 2 : Finally, we give a separate unluckiness function for the maximal margin hyperplane example. In practical experiments it is frequently observe that the number of support vectors is significantly smaller than the full training sample. Vapnik [48, Theorem 5.2] gives a boun on the expecte generalization error in terms of the number of support vectors as well as giving examples of classifiers [48, Table 5.2] for which the number of support vectors was very much less than the number of training examples. We will call this unluckiness function the support vectors unluckiness function. Definition 5.3 If h is a linear threshol function with separating hyperplane efine by (w ), an (w ) is the maximal margin hyperplane in canonical form with respect to an m-sample x, then efine U(x h)=jfx 2 x : jhx wi + j =1gj that is U is the number of support vectors of the hyperplane. 5.2 Probable Smoothness of Luckiness Functions We now introuce a technical restriction on luckiness functions require for our theorem. 16

17 Definition 5.4 An -subsequence of a vector x is a vector x 0 obtaine from x by eleting a fraction of at most coorinates. We will also write x 0 x. For a partitione vector xy, we write x 0 y 0 xy. A luckiness function L(x h) efine on a function class H is probably smooth with respect to functions (m L ) an (m L ), if, for all targets t in H an for every istribution P, P 2m fxy : 9h 2 H Er x (h) =0 8x 0 y 0 xy `(x 0 y 0 h) >(m L(x h) )g where = (m L(x h) ). The efinition for probably smooth unluckiness is ientical except that L s are replace by U s. The intuition behin this rather arcane efinition is that it captures when the luckiness can be estimate from the first half of the sample with high confience. In particular, we nee to ensure that few ichotomies are luckier than h on the ouble sample. That is, for a probably smooth luckiness function, if an hypothesis h has luckiness L on the first m points, we know that, with high confience, for most (at least a proportion (m L )) of the points in a ouble sample, the growth function for the class of functions that are at least as lucky as h is small (no more than (m L )). P 2m Theorem 5.5 Suppose p, = 1 ::: 2m, are positive numbers satisfying p i=1 i = 1, L is a luckiness function for a function class H that is probably smooth with respect to functions an, m 2 N an 0 < < 1=2. For any target function t 2 H an any istribution P, with probability 1 ; over m inepenent examples x chosen accoring to P, if for any i 2 N a learner fins an hypothesis h in H with Er x (h) = 0 an (m L(x h) ) 2 i+1, then the generalization error of h satisfies er P (h) (m i ) where (m i ) = 2 m Proof : By Lemma 3.8, i +1+log 4 p i +4(m L(x h) p i =4) log 4m: P m x : 9h 2 H 9i 2 N Er x (h) =0 (m L(x h) ) 2 i+1 er P (h) >(m i ) 2P 2m xy : 9h 2 H 9i 2 N Er x (h) =0 (m L(x h) ) 2 i+1 Er y (h) > m (m i ) 2 provie m 2=(m i ), which follows from the efinition of (m i ) an the fact that 1=2. Hence it suffices to show that P 2m (J i ) i = p i =2 for each i 2 N, where J i is the event xy : 9h 2 H Er x (h) =0 (m L(x h) ) 2 i+1 Er y (h) m (m i ) : 2 Let S be the event fxy : 9h 2 H Er x (h) =0 8x 0 y 0 xy `(x 0 y 0 h) >(m L(x h) )g with = (m L i i =2). It follows that P 2m (J i ) = P 2m (J i \ S) +P 2m (J i \ S) i =2+P 2m (J i \ S): 17

18 It suffices then to show that P 2m (J i \ S) i =2. But J i \ S is a subset of R = fxy: 9h 2 H Er x (h) =0 9x 0 y 0 xy `(x 0 y 0 h) 2 i+1 Er y 0(h) m 2 (m i ) ; (jyj;jy0 j) where jy 0 j enotes the length of the sequence y 0. Now, if we consier the uniform istribution U on the group of permutations on f1 ::: 2mg that swap elements i an i + m, wehave P 2m (R) sup U f :(xy) 2 Rg xy where z = (z (1) ::: z (2m) ) for z 2 X 2m. Fix xy 2 X 2m. For a subsequence x 0 y 0 xy, we let (x 0 y 0 ) enote the corresponing subsequence of the permute version of xy (an similarly for (x 0 ) an (y 0 ) ). Then U f:(xy) 2 Rg U : 9x 0 y 0 xy 9h 2 H `((x 0 y 0 ) h) 2 i+1 Er (x0 ) (h) =0 X x 0 y 0 xy Er (y 0 ) (h) m 2 (m i ) ; (jyj;jy0 j) U : 9h 2 H `((x 0 y 0 ) h) 2 i+1 Er (x0 ) (h) =0 Er (y 0 ) (h) m 2 (m i ) ; (jyj;jy0 j) : For a fixe subsequence x 0 y 0 xy, efine the event insie the last sum as A. We can partition the group of permutations into a number of equivalence classes, so that, for all i, within each class all permutations map i to a fixe value unless x 0 y 0 contains both x i an y i. Clearly, all equivalence classes have equal probability, so we have U(A) = X C Pr(AjC)Pr(C) sup Pr(AjC) C where the sum an supremum are over equivalence classes C. But within an equivalence class, (x 0 y 0 ) is a permutation of x 0 y 0, so we can write Pr(AjC) = Pr ; 9h 2 H `((x 0 y 0 ) h) 2 i+1 Er (x0 ) sup 2C Hj(x0 (h) =0 Er (y 0 ) (h) m 2 (m i ) ; (jyj;jy0 j) y 0 ) sup h C Pr ; Er (x0 ) (h) =0 Er (y 0 ) (h) m 2 (m i ) C (2) where the secon supremum is over the subset of H for which `((x 0 y 0 ) h) 2 i+1. Clearly, an the probability in (2) is no more than Hj(x0 y 0 ) 2 i+1 2 ;m(m i )=2+2m : 18

19 Combining these results, we have 2m P 2m (J i \ S) 2m an this is no more than i =2=p i =2 if m The theorem follows. 2 2 i+1 2 ;m=2(m i )+2m (m i ) 2m log(2m) +i +1+2m + log 4 p i : 6 Examples of Probably Smooth Luckiness Functions In this section, we consier four examples of luckiness functions an show that they are probably smooth. The first example (Example 5.1) is the simplest; in this case luckiness epens only on the hypothesis h an is inepenent of the examples x. In the secon example, luckiness epens only on the examples, an is inepenent of the hypothesis. The thir example allows us to preict the generalization performance of the maximal margin classifier. In this case, luckiness clearly epens on both the examples an the hypothesis. (This is the only example we present here where the luckiness function is both a function of the ata an the hypothesis.) The fourth example concerns the VC-imension of a class of functions when restricte to the particular sample available. First Example If we consier Example 5.1, the unluckiness function is clearly probably smooth if we choose (m U(x h) )=(2em=U) U an (m U ) =0for all m an. The boun on generalization error that we obtain from Theorem 5.5 is almost ientical to that given in Theorem 2.1. Secon Example The secon example we consier involves examples lying on hyperplanes. Definition 6.1 Define the unluckiness function U(x h) for a linear threshol function h as U(x h) = im spanfxg the imension of the vector space spanne by the vectors x. Proposition 6.2 Let H be the class of linear threshol functions efine on R. The unluckiness function of Definition 6.1 is probably smooth with respect to (m U ) =(2em=U) U an (m U ) = 4 2em 4 U ln +ln : m U 19

20 Proof : The recognition of a k imensional subspace is a learning problem for the inicator functions H k of the subspaces. These have VC imension k. Hence, applying the hierarchical approach of Theorem 2.1 taking p k = 1=, we obtain the given error boun for the number of examples in the secon half of the sequence lying outsie the subspace. Hence, with probability 1 ; there will be a (1 ; )-subsequence of points all lying in the given subspace. For this sequence the growth function is boune by (m U ). The above example will be useful if we have a istribution which is highly concentrate on the subspace with only a small probability of points lying outsie it. We conjecture that it is possible to relax the assumption that the probability istribution is concentrate exactly on the subspace, to take avantage of a situation where it is concentrate aroun the subspace an the classifications are compatible with a perpenicular projection onto the space. This will also make use of both the ata an the classification to ecie the luckiness. Thir Example We are now in a position to state the result concerning maximal margin hyperplanes. Proposition 6.3 The unluckiness function of Definition 5.2 is probably smooth with (m U ) = (2em=(9U)) 9U an (m U ) = 1 8em 2 k log log(32m) + log(4m + 2) + 2 log 2m k where k = b1297uc. Proof : By the efinition of the unluckiness function U, we have that the maximal margin hyperplane has margin satisfying, U = R 2 = 2 where R = max 1im kx ik: The proof works by allowing two sets of points to be exclue from the secon half of the sample, hence making up the value of. By ignoring these points with probability 1 ; the remaining points will be in the ball of raius R about the origin an will be correctly classifie by the maximal margin hyperplane with a margin of =3. Provie this is the case then the function (m U ) gives a boun on the growth function on the ouble sample of hyperplanes with larger margins. Hence, it remains to show that with probability 1 ; there exists a fraction of (m U ) points of the ouble sample whose removal leaves a subsequence of points satisfying the above conitions. First consier the class H = ff j 2 R + g where f (x) = 1 0 if kxk otherwise: 20

21 The class has VC imension 1 an so by the permutation argument with probability 1 ; =2 at most a fraction 1 of the secon half of the sample are outsie the ball B centere at the origin containing with raius R, where 1 = 1 m log(2m +1)+log 2 since the growth function B H (m) =m+1. We now consier the permutation argument applie to the points of the ouble sample containe in B to estimate how many are closer to the hyperplane than =3 or are incorrectly classifie. This involves an application of Lemma 3.6 with =3 substitute for an using the foling argument introuce just before that Lemma. We have by Corollary 4.4 that fat F () minf9r 2 = 2 n+1g +1 where F is the set of linear threshol functions with unit weight vector restricte to points in a ball of raius R about the origin. Hence, with probability 1 ; =2 at most a fraction 2 of the secon half of the sample that are in B are either not correctly classifie or within a margin of =3 of the hyperplane, where 2 = 1 m k log 8em k log(32m) + log 4 for k = b1297r 2 = 2 c = b1297uc. The result follows by aing the numbers of exclue points 1 m an 2 m an expressing the result as a fraction of the ouble sample as require. Combining the results of Theorem 5.5 an Proposition 6.3 gives the following corollary. Corollary 6.4 Suppose p, for = 1 ::: 2m, are positive numbers satisfying P 2m =1 p = 1. Suppose 0 < < 1=2, t 2 H, an P is a probability istribution on X. Then with probability 1 ; over m inepenent examples x chosen accoring to P, if a learner fins an hypothesis h that satisfies Er x (h) =0, then the generalization error of h is no more than (m U ) = 2 m + 9U log 2em 9U b1297uc log + 2 log p32m log 4 8em b1297u c p i + log(32m) + log(16m +8) where U = U(x h) for the unluckiness function of Definition 5.2. log 4m If we compare this corollary with Theorem 4.5, there is an extra log(m) factor that arises from the fact that we have to consier all possible permutations of the omitte subsequence in the general proof, whereas that is not necessary in the irect argument base on fat-shattering. The aitional generality obtaine here is that the support of the probability istribution oes not nee to be known, an even if it is we may erive avantage from observing points with small norms, hence giving a better value of U = R 2 = 2 than woul be obtaine in Theorem 4.5 where the a priori boun on R must be use. 21

22 Vapnik has use precisely the expression for this unluckiness function (given in Definition 5.2) as an estimate of the effective VC imension of the Support Vector Machine [48, p.139]. The functional obtaine is use to locate the best suite complexity class among ifferent polynomial kernel functions in the Support Vector Machine [48, Table 5.6]. The result above shows that this strategy is well-foune by giving a boun on the generalization error in terms of this quantity. It is interesting to note that the support vectors unluckiness function of Definition 5.3 relates to the same classifiers but one which for a given same sample efines a ifferent orering on the functions in the class in the sense that a large margin can occur with a large number of support vectors, while a small margin can be force by a small number of support vectors. We will omit the proof of the probable smoothness of the support vectors unluckiness function since a more irect boun on the generalization error can be obtaine using the results of Floy an Warmuth [19]. Since the set of support vectors is a compression scheme, Theorem 5.1 of [19] can be rephrase as follows. Let MMH be the function that returns the maximal margin hyperplane consistent with a labelle sample. Note that applying the function MMH to the labelle support vectors returns the maximal margin hyperplane of which they are the support vectors. Theorem 6.5 (Littlestone an Warmuth [33]) Let D be any probability istribution on a omain X, c be any concept on X. Then the probability that m examples rawn inepenently at ranom accoring to D contain a subset of at most examples that map via MMH to a hypothesis that is both consistent with all m examples an has error larger than is at most X m (1 ; ) m;i : i i=0 The theorem implies that the generalization error of a maximal margin hyperplane with support vectors among a sample of size m can with confience 1 ; be boune by 1 m ; log em + log m where we have allowe ifferent numbers of support vectors by applying stanar SRM to the bouns for ifferent. Note that the value, that is the unluckiness of Definition 5.3, plays the role of the VC imension in the boun. Using a similar technique to that of the above theorem it is possible to show that the support vectors unluckiness function is inee probably smooth with respect to (m ) = 1 log 2em 2m + log 1 2em an (m ) = : However, the resulting boun on the generalization error involves an extra log factor. 22

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction