A unified framework for Regularization Networks and Support Vector Machines. Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio

MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1654 March23, 1999 C.B.C.L Paper No. 171 A unified framework for Reguarization Networks and Support Vector Machines Theodoros Evgeniou, Massimiiano Ponti, Tomaso Poggio This pubication can be retrieved by anonymous ftp to pubications.ai.mit.edu. The pathname for this pubication is: ai-pubications/1500-1999/aim-1654.ps Abstract Reguarization Networks and Support Vector Machines are techniques for soving certain probems of earning from exampes in particuar the regression probem of approximating a mutivariate function from sparse data. We present bothformuations in a unified framework, namey in the context of Vapnik s theory of statistica earning which provides a genera foundation for the earning probem, combining functiona anaysis and statistics. Copyright c Massachusetts Institute of Technoogy, 1998 This report describers research done at the Center for Bioogica & Computationa Learning and the Artificia Inteigence Laboratory of the Massachusetts Institute of Technoogy. This research was sponsored by the Nationa Science Foundation under contract No. IIS-9800032, the Office of Nava Research under contract No. N00014-93-1-0385 and contract No. N00014-95-1-0600. Partia support was aso provided by Daimer-Benz AG, Eastman Kodak, Siemens Corporate Research, Inc., ATR and AT&T.

Contents 1 Introduction 3 2 Overview of statistica earning theory 5 2.1 Uniform Convergence and the Vapnik-Chervonenkis bound... 7 2.2 The method of Structura Risk Minimization... 10 2.3 ɛ-uniform convergence and the V γ dimension... 10 2.4 Overview of our approach... 13 3 Reproducing Kerne Hibert Spaces: a brief overview 14 4Reguarization Networks 16 4.1 Radia Basis Functions... 19 4.2 Reguarization, generaized spines and kerne smoothers... 20 4.3 Dua representation of Reguarization Networks... 21 4.4 From regression to cassification... 21 5 Support vector machines 22 5.1 SVMin RKHS... 22 5.2 From regression to cassification... 24 6 SRM for RNs and SVMs 26 6.1 SRMfor SVMCassification... 28 6.1.1 Distribution dependent bounds for SVMC... 29 7 A Bayesian Interpretation of Reguarization and SRM? 30 7.1 Maximum A Posteriori Interpretation of Reguarization... 30 7.2 Bayesian interpretation of the stabiizer in the RN and SVMfunctionas... 32 7.3 Bayesian interpretation of the data term in the Reguarization and SVMfunctiona 33 7.4 Why a MAP interpretation may be miseading... 33 8 Connections between SVMs and Sparse Approximation techniques 34 8.1 The probem of sparsity... 34 8.2 Equivaence between BPDN and SVMs... 36 8.3 Independent Component Anaysis... 37 9 Remarks 37 9.1 Reguarization Networks can impement SRM... 37 9.2 The SVMfunctiona is a specia formuation of reguarization... 38 9.3 SVM, sparsity and compression... 38 9.4 Gaussian processes, reguarization and SVM... 39 9.5 Kernes and how to choose an input representation... 39 9.6 Capacity contro and the physica word... 40 A Reguarization Theory for Learning 41 B An exampe of RKHS 42 1

C Reguarized Soutions in RKHS 43 D Reation between SVMC and SVMR 44 E Proof of the theorem 6.2 45 F The noise mode of the data term in SVMR 46 2

1 Introduction The purpose of this paper is to present a theoretica framework for the probem of earning from exampes. Learning from exampes can be regarded as the regression probem of approximating a mutivariate function from sparse data and we wi take this point of view here 1. The probem of approximating a function from sparse data is i-posed and a cassica way to sove it is reguarization theory [92, 10, 11]. Cassica reguarization theory, as we wi consider here 2,formuates the regression probem as a variationa probem of finding the function f that minimizes the functiona min H[f] =1 (y i f(x i )) 2 + λ f 2 K (1) f H where f 2 K is a norm in a Reproducing Kerne Hibert Space H defined by the positive definite function K, is the number of data points or exampes (the pairs (x i,y i )) and λ is the reguarization parameter (see the semina work of [102]). Under rather genera conditions the soution of equation (1) is f(x) = c i K(x, x i ). (2) Unti now the functionas of cassica reguarization have acked a rigorous justification for a finite set of training data. Their formuation is based on functiona anaysis arguments which rey on asymptotic resuts and do not consider finite data sets 3. Reguarization is the approach we have taken in earier work on earning [69, 39, 77]. The semina work of Vapnik [94, 95, 96] has now set the foundations for a more genera theory that justifies reguarization functionas for earning from finite sets and can be used to extend consideraby the cassica framework of reguarization, effectivey marrying a functiona anaysis perspective with modern advances in the theory of probabiity and statistics. The basic idea of Vapnik s theory is cosey reated to reguarization: for a finite set of training exampes the search for the best mode or approximating function has to be constrained to an appropriatey sma hypothesis space (which can aso be thought of as a space of machines or modes or network architectures). If the space is too arge, modes can be found which wi fit exacty the data but wi have a poor generaization performance, that is poor predictive capabiity on new data. Vapnik s theory characterizes and formaizes these concepts in terms of the capacity of a set of functions and capacity contro depending on the training data: for instance, for a sma training set the capacity of the function space in which f is sought has to be sma whereas it can increase with a arger training set. As we wi see ater in the case of reguarization, a form of capacity contro eads to choosing an optima λ in equation (1) for a given set of data. A key part of the theory is to define and bound the capacity of a set of functions. Thus the key and somewhat nove theme of this review is a) to describe a unified framework for severa earning techniques for finite training sets and b) to justify them in terms of statistica earning theory. We wi consider functionas of the form 1 There is a arge iterature on the subject: usefu reviews are [44, 19, 102, 39], [96] and references therein. 2 The genera reguarization scheme for earning is sketched in Appendix A. 3 The method of quasi-soutions of Ivanov and the equivaent Tikhonov s reguarization technique were deveoped to sove i-posed probems of the type Af = F,whereA is a (inear) operator, f is the desired soution in a metric space E 1,andF are the data in a metric space E 2. 3

H[f] = 1 V (y i,f(x i )) + λ f 2 K, (3) where V (, ) isaoss function. We wi describe how cassica reguarization and Support Vector Machines [96] for both regression (SVMR) and cassification (SVMC) correspond to the minimization of H in equation (3) for different choices of V : Cassica (L 2 ) Reguarization Networks (RN) Support Vector Machines Regression (SVMR) V (y i,f(x i )) = (y i f(x i )) 2 (4) Support Vector Machines Cassification (SVMC) V (y i,f(x i )) = y i f(x i ) ɛ (5) V (y i,f(x i )) = 1 y i f(x i ) + (6) where ɛ is Vapnik s epsion-insensitive norm (see ater), x + = x if x is positive and zero otherwise, and y i is a rea number in RN and SVMR, whereas it takes vaues 1, 1 in SVMC. Loss function (6) is aso caed the soft margin oss function. For SVMC, we wi aso discuss two other oss functions: The hard margin oss function: The miscassification oss function: V (y i,f(x)) = θ(1 y i f(x i )) (7) V (y i,f(x)) = θ( y i f(x i )) (8) Where θ( ) is the Heaviside function. For cassification one shoud minimize (8) (or (7)), but in practice other oss functions, such as the soft margin one (6) [22, 95], are used. We discuss this issue further in section 6. The minimizer of (3) using the three oss functions has the same genera form (2) (or f(x) = c i K(x, x i )+b, see ater) but interestingy different properties 4. In this review we wi show how different earning techniques based on the minimization of functionas of the form of H in (3) can be justified for a few choices of V (, ) using a sight extension of the toos and resuts of Vapnik s statistica earning theory. In section 2 we outine the main resuts in the theory of statistica earning and in particuar Structura Risk Minimization the technique suggested by Vapnik to sove the probem of capacity contro in earning from sma training sets. At the end of the section we wi outine a technica extension of Vapnik s Structura Risk Minimization framework (SRM). With this extension both RN and Support Vector Machines (SVMs) can be seen within a SRMscheme. In recent years a number of papers caim that SVMcannot be 4 For genera differentiabe oss functions V the form of the soution is sti the same, as shown in Appendix C. 4

justified in a data-independent SRMframework (i.e. [86]). One of the goas of this paper is to provide such a data-independent SRMframework that justifies SVMas we as RN. Before describing reguarization techniques, section 3 reviews some basic facts on RKHS which are the main function spaces on which this review is focused. After the section on reguarization (section 4) we wi describe SVMs (section 5). As we saw aready, SVMs for regression can be considered as a modification of reguarization formuations of the type of equation (1). Radia Basis Functions (RBF) can be shown to be soutions in both cases (for radia K) but with a rather different structure of the coefficients c i. Section 6 describes in more detai how and why both RN and SVMcan be justified in terms of SRM, in the sense of Vapnik s theory: the key to capacity contro is how to choose λ for a given set of data. Section 7 describes a naive Bayesian Maximum A Posteriori (MAP) interpretation of RNs and of SVMs. It aso shows why a forma MAP interpretation, though interesting and even usefu, may be somewhat miseading. Section 8 discusses reations of the reguarization and SVMtechniques with other representations of functions and signas such as sparse representations from overcompete dictionaries, Bind Source Separation, and Independent Component Anaysis. Finay, section 9 summarizes the main themes of the review and discusses some of the open probems. 2 Overview of statistica earning theory We consider the case of earning from exampes as defined in the statistica earning theory framework [94, 95, 96]. We have two sets of variabes x X R d and y Y R that are reated by a probabiistic reationship. We say that the reationship is probabiistic because generay an eement of X does not determine uniquey an eement of Y, but rather a probabiity distribution on Y. This can be formaized assuming that a probabiity distribution P (x,y)is defined over the set X Y. The probabiity distribution P (x,y) is unknown, and under very genera conditions can be written as P (x,y)=p(x)p(y x) wherep (y x) is the conditiona probabiity of y given x, andp (x) is the margina probabiity of x. We are provided with exampes of this probabiistic reationship, that is with a data set D {(x i,y i ) X Y } caed the training data, obtained by samping times the set X Y according to P (x,y). The probem of earning consists in, given the data set D, providing an estimator, that is a function f : X Y, that can be used, given any vaue of x X, to predict a vaue y. In statistica earning theory, the standard way to sove the earning probem consists in defining a risk functiona, which measures the average amount of error associated with an estimator, and then to ook for the estimator, among the aowed ones, with the owest risk. If V (y, f(x)) is the oss function measuring the error we make when we predict y by f(x) 5, then the average error is the so caed expected risk: I[f] V (y, f(x))p (x,y) dxdy (9) X,Y We assume that the expected risk is defined on a arge cass of functions F and we wi denote by f 0 the function which minimizes the expected risk in F: 5 Typicay for regression the oss functions is of the form V (y f(x)). f 0 (x) =argmini[f] (10) F 5

The function f 0 is our idea estimator, and it is often caed the target function 6. Unfortunatey this function cannot be found in practice, because the probabiity distribution P (x,y) that defines the expected risk is unknown, and ony a sampe of it, the data set D,is avaiabe. To overcome this shortcoming we need an induction principe that we can use to earn from the imited number of training data we have. Statistica earning theory as deveoped by Vapnik buids on the so-caed empirica risk minimization (ERM) induction principe. The ERM method consists in using the data set D to buid a stochastic approximation of the expected risk, which is usuay caed the empirica risk, and is defined as 7 : I emp [f; ] = 1 V (y i,f(x i )). (11) The centra question of the theory is whether the expected risk of the minimizer of the empirica risk in F is cose to the expected risk of f 0. Notice that the question is not necessariy whether we can find f 0 but whether we can imitate f 0 in the sense that the expected risk of our soution is cose to that of f 0. Formay the theory answers the question of finding under which conditions the method of ERMsatisfies: im I emp[ ˆf ; ] = im I[ ˆf ]=I[f 0 ] (12) in probabiity (a statements are probabiistic since we start with P (x,y) on the data), where we note with ˆf the minimizer of the empirica risk (11) in F. It can been shown (see for exampe [96]) that in order for the imits in eq. (12) to hod true in probabiity, or more precisey, for the empirica risk minimization principe to be non-triviay consistent (see [96] for a discussion about consistency versus non-trivia consistency), the foowing uniform aw of arge numbers (which transates to one-sided uniform convergence in probabiity of empirica risk to expected risk in F) isanecessary and sufficient condition: { } im P sup (I[f] I emp [f; ]) >ɛ f F =0 ɛ >0 (13) Intuitivey, if F is very arge then we can aways find ˆf Fwith 0 empirica error. This however does not guarantee that the expected risk of ˆf is aso cose to 0, or cose to I[f 0 ]. Typicay in the iterature the two-sided uniform convergence in probabiity: { } im P sup I[f] I emp [f; ] >ɛ f F =0 ɛ >0 (14) is considered, which ceary impies (13). In this paper we focus on the stronger two-sided case and note that one can get one-sided uniform convergence with some minor technica changes to the theory. We wi not discuss the technica issues invoved in the reations between consistency, non-trivia consistency, two-sided and one-sided uniform convergence (a discussion can be found in [96]), and from now on we concentrate on the two-sided uniform convergence in probabiity, which we simpy refer to as uniform convergence. The theory of uniform convergence of ERMhas been deveoped in [97, 98, 99, 94, 96]. It has aso been studied in the context of empirica processes [29, 74, 30]. Here we summarize the main resuts of the theory. 6 In the case that V is (y f(x)) 2, the minimizer of eq. (10) is the regression function f 0 (x) = yp(y x)dy. 7 It is important to notice that the data terms (4), (5) and (6) are used for the empirica risks I emp. 6

2.1 Uniform Convergence and the Vapnik-Chervonenkis bound Vapnik and Chervonenkis [97, 98] studied under what conditions uniform convergence of the empirica risk to expected risk takes pace. The resuts are formuated in terms of three important quantities that measure the compexity of a set of functions: the VC entropy, the anneaed VC entropy, and the growth function. We begin with the definitions of these quantities. First we define the minima ɛ-net of a set, which intuitivey measures the cardinaity of a set at resoution ɛ: Definition 2.1 Let A be a set in a metric space A with distance metric d. For a fixed ɛ>0, the set B Ais caed an ɛ-net of A in A, if for any point a A there is a point b B such that d(a, b) <ɛ. We say that the set B is a minima ɛ-net of A in A, if it is finite and contains the minima number of eements. Given a training set D = {(x i,y i ) X Y }, consider the set of -dimensiona vectors: q(f) =(V (y 1,f(x 1 )),..., V (y,f(x ))) (15) with f F, and define the number of eements of the minima ɛ-net of this set under the metric: d(q(f),q(f )) = max 1 i V (y i,f(x i )) V (y i,f (x i )) to be N F (ɛ; D ) (which ceary depends both on F and on the oss function V ). Intuitivey this quantity measures how many different functions effectivey we have at resoution ɛ, when we ony care about the vaues of the functions at points in D. Using this quantity we now give the foowing definitions: Definition 2.2 Given a set X Y and a probabiity P (x,y) defined over it, the VC entropy of a set of functions V (y, f(x)), f F, on a data set of size is defined as: H F (ɛ; ) X,Y n N F (ɛ; D ) P (x i,y i )dx i dy i Definition 2.3 Given a set X Y and a probabiity P (x,y) defined over it, the anneaed VC entropy of a set of functions V (y, f(x)), f F, on a data set of size is defined as: H F ann (ɛ; ) n X,Y N F (ɛ; D ) P (x i,y i )dx i dy i Definition 2.4 Given a set X Y, the growth function of a set of functions V (y, f(x)), f F, on a data set of size is defined as: ( ) G F (ɛ; ) n sup N F (ɛ; D ) D (X Y ) 7

Notice that a three quantities are functions of the number of data and of ɛ, and that ceary: H F (ɛ; ) H F ann (ɛ; ) GF (ɛ; ). These definitions can easiy be extended in the case of indicator functions, i.e. functions taking binary vaues 8 such as { 1, 1}, in which case the three quantities do not depend on ɛ for ɛ<1, since the vectors (15) are a at the vertices of the hypercube {0, 1}. Using these definitions we can now state three important resuts of statistica earning theory [96]: For a given probabiity distribution P (x,y): 1. The necessary and sufficient condition for uniform convergence is that H F (ɛ; ) im =0 ɛ >0 2. A sufficient condition for fast asymptotic rate of convergence 9 is that im Hann F (ɛ; ) =0 ɛ >0 It is an open question whether this is aso a necessary condition. A sufficient condition for distribution independent (that is, for any P (x,y)) fast rate of convergence is that G F (ɛ; ) im =0 ɛ >0 For indicator functions this is aso a necessary condition. According to statistica earning theory, these three quantities are what one shoud consider when designing and anayzing earning machines: the VC-entropy and the anneaed VC-entropy for an anaysis which depends on the probabiity distribution P (x,y) of the data, and the growth function for a distribution independent anaysis. In this paper we consider ony distribution independent resuts, athough the reader shoud keep in mind that distribution dependent resuts are ikey to be important in the future. Unfortunatey the growth function of a set of functions is difficut to compute in practice. So the standard approach in statistica earning theory is to use an upper bound on the growth function which is given using another important quantity, the VC-dimension, which is another (ooser) measure of the compexity, capacity, of a set of functions. In this paper we concentrate on this quantity, but it is important that the reader keeps in mind that the VC-dimension is in a sense a weak measure of compexity of a set of functions, so it typicay eads to oose upper bounds on the growth function: in genera one is better off, theoreticay, using directy the growth function. We now discuss the VC-dimension and its impications for earning. The VC-dimension was first defined for the case of indicator functions and then was extended to rea vaued functions. 8 In the case of indicator functions, y is binary, and V is 0 for f(x) =y, 1otherwise. 9 This means that for any > 0 we have that P {sup f F I[f] I emp [f] >ɛ} <e cɛ2 for some constant c>0. Intuitivey, fast rate is typicay needed in practice. 8

Definition 2.5 The VC-dimension of a set {θ(f(x)),f F}, of indicator functions is the maximum number h of vectors x 1,...,x h that can be separated into two casses in a 2 h possibe ways using functions of the set. If, for any number N, it is possibe to find N points x 1,...,x N that can be separated in a the 2 N possibe ways, we wi say that the VC-dimension of the set is infinite. The remarkabe property of this quantity is that, athough as we mentioned the VC-dimension ony provides an upper bound to the growth function, in the case of indicator functions, finiteness of the VC-dimension is a necessary and sufficient condition for uniform convergence (eq. (14)) independent of the underying distribution P (x,y). Definition 2.6 Let A V (y, f(x)) B, f F, with A and B<. The VC-dimension of the set {V (y, f(x)), f F}is defined as the VC-dimension of the set of indicator functions {θ (V (y, f(x)) α), α (A, B)}. Sometimes we refer to the VC-dimension of {V (y, f(x)), f F}as the VC dimension of V in F. Itcanbeeasiyshownthatfory { 1, +1} and for V (y, f(x)) = θ( yf(x)) as the oss function, the VC dimension of V in F computed using definition 2.6 is equa to the VC dimension of the set of indicator functions {θ(f(x)), f F}computed using definition 2.5. In the case of rea vaued functions, finiteness of the VC-dimension is ony sufficient for uniform convergence. Later in this section we wi discuss a measure of capacity that provides aso necessary conditions. An important outcome of the work of Vapnik and Chervonenkis is that the uniform deviation between empirica risk and expected risk in a hypothesis space can be bounded in terms of the VC-dimension, as shown in the foowing theorem: Theorem 2.1 (Vapnik and Chervonenkis 1971) Let A V (y, f(x)) B, f F, F be a set of bounded functions and h the VC-dimension of V in F. Then, with probabiity at east 1 η, the foowing inequaity hods simutaneousy for a the eements f of F: I emp [f; ] (B A) h n 2e n( η ) h n 2e h 4 I[f] I emp [f; ]+(B A) n( η ) h 4 (16) The quantity I[f] I emp [f; ] is often caed estimation error, and bounds of the type above are usuay caed VC bounds 10. From eq. (16) it is easy to see that with probabiity at east 1 η: I[ ˆf ] 2(B A) h n 2e h n( η 4 ) I[f 0 ] I[ ˆf ]+2(B A) h n 2e h n( η 4 ) where ˆf is, as in (12), the minimizer of the empirica risk in F. A very interesting feature of inequaities (16) and (17) is that they are non-asymptotic, meaning that they hod for any finite number of data points, and that the error bounds do not necessariy depend on the dimensionaity of the variabe x. Observe that theorem (2.1) and inequaity (17) are meaningfu in practice ony if the VCdimension of the oss function V in F is finite and ess than. Since the space F where the 10 It is important to note that bounds on the expected risk using the anneaed VC-entropy aso exist. These are tighter than the VC-dimension ones. 9 (17)

oss function V is defined is usuay very arge (i.e. a functions in L 2 ), one typicay considers smaer hypothesis spaces H. The cost associated with restricting the space is caed the approximation error (see beow). In the iterature, space F where V is defined is caed the target space, whie H is what is caed the hypothesis space. Of course, a the definitions and anaysis above sti hod for H, where we repace f 0 with the minimizer of the expected risk in H, ˆf is now the minimizer of the empirica risk in H, andh the VC-dimension of the oss function V in H. Inequaities (16) and (17) suggest a method for achieving good generaization: not ony minimize the empirica risk, but instead minimize a combination of the empirica risk and the compexity of the hypothesis space. This observation eads us to the method of Structura Risk Minimization that we describe next. 2.2 The method of Structura Risk Minimization The idea of SRMis to define a nested sequence of hypothesis spaces H 1 H 2... H n() with n() a non-decreasing integer function of, where each hypothesis space H i has VC-dimension finite and arger than that of a previous sets, i.e. if h i is the VC-dimension of space H i,then h 1 h 2... h n(). For exampe H i coud be the set of poynomias of degree i, orasetof spines with i nodes, or some more compicated noninear parameterization. For each eement H i of the structure the soution of the earning probem is: ˆf i, =argmin f H i I emp [f; ] (18) Because of the way we define our structure it shoud be cear that the arger i is the smaer the empirica error of ˆf i, is (since we have greater fexibiity to fit our training data), but the arger the VC-dimension part (second term) of the right hand side of (16) is. Using such a nested sequence of more and more compex hypothesis spaces, the SRMearning technique consists of choosing the space H n () for which the right hand side of inequaity (16) is minimized. It can be shown [94] that for the chosen soution ˆf n (), inequaities (16) and (17) hod with probabiity at east (1 η) n() 1 n()η 11, where we repace h with h n (), f 0 with the minimizer of the expected risk in H n (), nameyf n (), and ˆf with ˆf n (),. With an appropriate choice of n() 12 it can be shown that as and n(), the expected risk of the soution of the method approaches in probabiity the minimum of the expected risk in H = H i,nameyi[f H ]. Moreover, if the target function f 0 beongs to the cosure of H, then eq. (12) hods in probabiity (see for exampe [96]). However, in practice is finite ( sma ), so n() is sma which means that H = n() H i is a sma space. Therefore I[f H ] may be much arger than the expected risk of our target function f 0,sincef 0 may not be in H. The distance between I[f H ]andi[f 0 ] is caed the approximation error and can be bounded using resuts from approximation theory. We do not discuss these resuts here and refer the reader to [54, 26]. 2.3 ɛ-uniform convergence and the V γ dimension As mentioned above finiteness of the VC-dimension is not a necessary condition for uniform convergence in the case of rea vaued functions. To get a necessary condition we need a sight 11 We want (16) to hod simutaneousy for a spaces H i, since we choose the best ˆf i,. 12 Various cases are discussed in [27], i.e. n() =. 10

extension of the VC-dimension that has been deveoped (among others) in [50, 2], known as the V γ dimension 13. Here we summarize the main resuts of that theory that we wi aso use ater on to design regression machines for which we wi have distribution independent uniform convergence. Definition 2.7 Let A V (y, f(x)) B, f F, with A and B<. The V γ -dimension of V in F (of the set {V (y, f(x)), f F}) is defined as the the maximum number h of vectors (x 1,y 1 )...,(x h,y h ) that can be separated into two casses in a 2 h possibe ways using rues: cass 1 if: V (y i,f(x i )) s + γ cass 0 if: V (y i,f(x i )) s γ for f Fand some s 0. If, for any number N, it is possibe to find N points (x 1,y 1 )...,(x N,y N ) that can be separated in a the 2 N possibe ways, we wi say that the V γ -dimension of V in F is infinite. Notice that for γ = 0 this definition becomes the same as definition 2.6 for VC-dimension. Intuitivey, for γ>0 the rue for separating points is more restrictive than the rue in the case γ = 0. It requires that there is a margin between the points: points for which V (y, f(x)) is between s + γ and s γ are not cassified. As a consequence, the V γ dimension is a decreasing function of γ and in particuar is smaer than the VC-dimension. If V is an indicator function, say θ( yf(x)), then for any γ definition 2.7 reduces to that of the VC-dimension of a set of indicator functions. Generaizing sighty the definition of eq. (14) we wi say that for a given ɛ>0theermmethod converges ɛ-uniformy in F in probabiity, (or that there is ɛ-uniform convergence) if: im P { sup I emp [f; ] I[f] >ɛ f F } =0. (19) Notice that if eq. (19) hods for every ɛ>0 we have uniform convergence (eq. (14)). It can be shown (variation of [96]) that ɛ-uniform convergence in probabiity impies that: I[ ˆf ] I[f 0 ]+2ɛ (20) in probabiity, where, as before, ˆf is the minimizer of the empirica risk and f 0 is the minimizer of the expected expected risk in F 14. The basic theorems for the V γ -dimension are the foowing: Theorem 2.2 (Aon et a., 1993 ) Let A V (y, f(x))) B, f F, F be a set of bounded functions. For any ɛ>0, ifthev γ dimension of V in F is finite for γ = αɛ for some constant α 1, then the ERM method ɛ-converges in probabiity. 48 Theorem 2.3 (Aon et a., 1993 ) Let A V (y, f(x))) B, f F, F be a set of bounded functions. The ERM method uniformy converges (in probabiity) if and ony if the V γ dimension of V in F is finite for every γ>0. So finiteness of the V γ dimension for every γ>0is a necessary and sufficient condition for distribution independent uniform convergence of the ERM method for rea-vaued functions. 13 In the iterature, other quantities, such as the fat-shattering dimension and the P γ dimension, are aso defined. They are cosey reated to each other, and are essentiay equivaent to the V γ dimension for the purpose of this paper. The reader can refer to [2, 7] for an in-depth discussion on this topic. 14 This is ike ɛ-earnabiity in the PAC mode [93]. 11

Theorem 2.4 (Aon et a., 1993 ) Let A V (y, f(x)) B, f F, F be a set of bounded functions. For any ɛ 0, for a 2 ɛ 2 we have that if h γ is the V γ dimension of V in F for γ = αɛ (α 1 ), h 48 γ finite, then: { } P sup I emp [f; ] I[f] >ɛ f F G(ɛ,, h γ ), (21) where G is an increasing function of h γ and a decreasing function of ɛ and, with G 0as 15. From this theorem we can easiy see that for any ɛ>0, for a 2 ɛ 2 : P { I[ ˆf ] I[f 0 ]+2ɛ } 1 2G(ɛ,, h γ ), (22) where ˆf is, as before, the minimizer of the empirica risk in F. An important observations to keep in mind is that theorem 2.4 requires the V γ dimension of the oss function V in F. In the case of cassification, this impies that if we want to derive bounds on the expected miscassification we have to use the V γ dimension of the oss function θ( yf(x)) (which is the VC dimension of the set of indicator functions {sgn (f(x)),f F}), and not the V γ dimension of the set F. The theory of the V γ dimension justifies the extended SRMmethod we describe beow. It is important to keep in mind that the method we describe is ony of theoretica interest and wi ony be used ater as a theoretica motivation for RN and SVM. It shoud be cear that a the definitions and anaysis above sti hod for any hypothesis space H, where we repace f 0 with the minimizer of the expected risk in H, ˆf is now the minimizer of the empirica risk in H, and h the VC-dimension of the oss function V in H. Let be the number of training data. For a fixed ɛ>0such that 2,etγ = 1 ɛ,and ɛ 2 48 consider, as before, a nested sequence of hypothesis spaces H 1 H 2... H n(,ɛ),whereeach hypothesis space H i has V γ -dimension finite and arger than that of a previous sets, i.e. if h i is the V γ -dimension of space H i,thenh 1 h 2... h n(,ɛ). For each eement H i of the structure consider the soution of the earning probem to be: ˆf i, =argmin f H i I emp [f; ]. (23) Because of the way we define our structure the arger i is the smaer the empirica error of ˆf i, is (since we have more fexibiity to fit our training data), but the arger the right hand side of inequaity (21) is. Using such a nested sequence of more and more compex hypothesis spaces, this extended SRM earning technique consists of finding the structure eement H n (,ɛ) for which the trade off between empirica error and the right hand side of (21) is optima. One practica idea is to find numericay for each H i the effective ɛ i so that the bound (21) is the same for a H i,andthenchoose ˆf i, for which the sum of the empirica risk and ɛ i is minimized. We conjecture that as, for appropriate choice of n(, ɛ) withn(, ɛ) as,the expected risk of the soution of the method converges in probabiity to a vaue ess than 2ɛ away from the minimum expected risk in H = H i. Notice that we described an SRMmethod for afixedɛ. IftheV γ dimension of H i is finite for every γ>0, we can further modify the extended SRMmethod so that ɛ 0as.Weconjecture that if the target function f 0 beongs to the 15 Cosed forms of G can be derived (see for exampe [2]) but we do not present them here for simpicity of notation. 12

cosure of H, thenas, with appropriate choices of ɛ, n(, ɛ) andn (, ɛ) the soution of this SRMmethod can be proven (as before) to satisfy eq. (12) in probabiity. Finding appropriate forms of ɛ, n(, ɛ) andn (, ɛ) is an open theoretica probem (which we beieve to be a technica matter). Again, as in the case of standard SRM, in practice is finite so H = n(,ɛ) H i is a sma space and the soution of this method may have expected risk much arger that the expected risk of the target function. Approximation theory can be used to bound this difference [61]. The proposed method is difficut to impement in practice since it is difficut to decide the optima trade off between empirica error and the bound (21). If we had constructive bounds on the deviation between the empirica and the expected risk ike that of theorem 2.1 then we coud have a practica way of choosing the optima eement of the structure. Unfortunatey existing bounds of that type [2, 7] are not tight. So the fina choice of the eement of the structure may be done in practice using other techniques such as cross-vaidation [102]. 2.4 Overview of our approach In order to set the stage for the next two sections on reguarization and Support Vector Machines, we outine here how we can justify the proper use of the RN and the SVMfunctionas (see (3)) in the framework of the SRMprincipes just described. The basic idea is to define a structure in terms of a nested sequence of hypothesis spaces H 1 H 2... H n() with H m being the set of functions f in the RKHS with: f K A m, (24) where A m is a monotonicay increasing sequence of positive constants. Foowing the SRM method outined above, for each m we wi minimize the empirica risk 1 V (y i,f(x i )), subject to the constraint (24). This in turn eads to using the Lagrange mutipier λ m and to minimizing 1 V (y i,f(x i )) + λ m ( f 2 K A2 m ), with respect to f and maximizing with respect to λ m 0 for each eement of the structure. We can then choose the optima n () and the associated λ (), and get the optima soution ˆf n (). The soution we get using this method is ceary the same as the soution of: 1 V (y i,f(x i )) + λ () f 2 K (25) where λ () is the optima Lagrange mutipier corresponding to the optima eement of the structure A n (). Notice that this approach is quite genera. In particuar it can be appied to cassica L 2 reguarization, to SVMregression, and, as we wi see, to SVMcassification with the appropriate V (, ). In section 6 we wi describe in detai this approach for the case that the eements of the structure are infinite dimensiona RKHS. We have outined this theoretica method here so that the reader 13

understands our motivation for reviewing in the next two sections the approximation schemes resuting from the minimization of functionas of the form of equation (25) for three specific choices of the oss function V : V (y, f(x)) = (y f(x)) 2 for reguarization. V (y, f(x)) = y f(x) ɛ for SVMregression. V (y, f(x)) = 1 yf(x) + for SVMcassification. For SVMcassification the oss functions: V (y, f(x)) = θ(1 yf(x)) (hard margin oss function), and V (y, f(x)) = θ( yf(x)) (miscassification oss function) wi aso be discussed. First we present an overview of RKHS which are the hypothesis spaces we consider in the paper. 3 Reproducing Kerne Hibert Spaces: a brief overview A Reproducing Kerne Hibert Space (RKHS) [5] is a Hibert space H of functions defined over some bounded domain X R d with the property that, for each x X, the evauation functionas F x defined as F x [f] =f(x) f H are inear, bounded functionas. The boundedness means that there exists a U = U x R + such that: F x [f] = f(x) U f for a f in the RKHS. It can be proved [102] that to every RKHS H there corresponds a unique positive definite function K(x, y) oftwovariabes inx, caed the reproducing kerne of H (hence the terminoogy RKHS), that has the foowing reproducing property: f(x) =<f(y),k(y, x) > H f H, (26) where <, > H denotes the scaar product in H. The function K behaves in H as the deta function does in L 2, athough L 2 is not a RKHS (the functionas F x are ceary not bounded). To make things cearer we sketch a way to construct a RKHS, which is reevant to our paper. The mathematica detais (such as the convergence or not of certain series) can be found in the theory of integra equations [45, 20, 23]. Let us assume that we have a sequence of positive numbers λ n and ineary independent functions φ n (x) such that they define a function K(x, y) in the foowing way 16 : K(x, y) λ n φ n (x)φ n (y), (27) n=0 16 When working with compex functions φ n (x) this formua shoud be repaced with K(x, y) n=0 λ nφ n (x)φ n (y) 14

where the series is we defined (for exampe it converges uniformy). A simpe cacuation shows that the function K defined in eq. (27) is positive definite. Let us now take as our Hibert space to be the set of functions of the form: f(x) = a n φ n (x) (28) n=0 for any a n R, and define the scaar product in our space to be: a n d n < a n φ n (x), d n φ n (x) > H. (29) n=0 n=0 n=0 λ n Assuming that a the evauation functionas are bounded, it is now easy to check that such an Hibert space is a RKHS with reproducing kerne given by K(x, y). In fact we have: a n λ n φ n (x) <f(y),k(y, x) > H = = a n φ n (x) =f(x), (30) n=0 λ n n=0 hence equation (26) is satisfied. Notice that when we have a finite number of φ n,theλ n can be arbitrary (finite) numbers, since convergence is ensured. In particuar they can a be equa to one. Generay, it is easy to show [102] that whenever a function K of the form (27) is avaiabe, it is possibe to construct a RKHS as shown above. Vice versa, for any RKHS there is a unique kerne K and corresponding λ n, φ n, that satisfy equation (27) and for which equations (28), (29) and (30) hod for a functions in the RKHS. Moreover, equation (29) shows that the norm of the RKHS has the form: f 2 K = a 2 n. (31) n=0 λ n The φ n consist a basis for the RKHS (not necessariy orthonorma), and the kerne K is the correation matrix associated with these basis functions. It is in fact we know that there is a cose reation between Gaussian processes and RKHS [58, 40, 72]. Wahba [102] discusses in depth the reation between reguarization, RKHS and correation functions of Gaussian processes. The choice of the φ n defines a space of functions the functions that are spanned by the φ n. We aso ca the space {(φ n (x)) n=1, x X} the feature space induced by the kerne K. The choice of the φ n defines the feature space where the data x are mapped. In this paper we refer to the dimensionaity of the feature space as the dimensionaity of the RKHS. This is ceary equa to the number of basis eements φ n, which does not necessariy have to be infinite. For exampe, with K a Gaussian, the dimensionaity of the RKHS is infinite (φ n (x) are the Fourier components e in x ), whie when K is a poynomia of degree k (K(x, y) =(1+x y) k - see section 4), the dimensionaity of the RKHS is finite, and a the infinite sums above are repaced with finite sums. It is we known that expressions of the form (27) actuay abound. In fact, it foows from Mercer s theorem [45] that any function K(x, y) which is the kerne of a positive operator 17 in L 2 (Ω) has an expansion of the form (27), in which the φ i and the λ i are respectivey the orthogona eigenfunctions and the positive eigenvaues of the operator corresponding to K. In 17 We remind the reader that positive definite operators in L 2 are sef-adjoint operators suchthat <Kf,f> 0 for a f L 2. 15

[91] it is reported that the positivity of the operator associated to K is equivaent to the statement that the kerne K is positive definite, that is the matrix K ij = K(x i, x j ) is positive definite for a choices of distinct points x i X. NoticethatakerneK coud have an expansion of the form (27) in which the φ n are not necessariy its eigenfunctions. The ony requirement is that the φ n are ineary independent but not necessariy orthogona. In the case that the space X has finite cardinaity, the functions f are evauated ony at a finite number of points x. If M is the cardinaity of X, then the RKHS becomes an M-dimensiona space where the functions f are basicay M-dimensiona vectors, the kerne K becomes an M M matrix, and the condition that makes it a vaid kerne is that it is a symmetric positive definite matrix (semi-definite if M is arger than the dimensionaity of the RKHS). Positive definite matrices are known to be the ones which define dot products, i.e. fkf T 0 for every f in the RKHS. The space consists of a M-dimensiona vectors f with finite norm fkf T. Summarizing, RKHS are Hibert spaces where the dot product is defined using a function K(x, y) which needs to be positive definite just ike in the case that X has finite cardinaity. The eements of the RKHS are a functions f that have a finite norm given by equation (31). Notice the equivaence of a) choosing a specific RKHS H b) choosing a set of φ n and λ n c) choosing a reproducing kerne K. The ast one is the most natura for most appications. A simpe exampe of a RKHS is presented in Appendix B. Finay, it is usefu to notice that the soutions of the methods we discuss in this paper can be written both in the form (2), and in the form (28). Often in the iterature formuation (2) is caed the dua form of f, whie (28) is caed the prima form of f. 4 Reguarization Networks In this section we consider the approximation scheme that arises from the minimization of the quadratic functiona min H[f] =1 (y i f(x i )) 2 + λ f 2 K (32) f H for a fixed λ. Formuations ike equation (32) are a specia form of reguarization theory deveoped by Tikhonov, Ivanov [92, 46] and others to sove i-posed probems and in particuar to sove the probem of approximating the functiona reation between x and y given a finite number of exampes D = {x i,y i }. As we mentioned in the previous sections our motivation in this paper is to use this formuation as an approximate impementation of Vapnik s SRMprincipe. In cassica reguarization the data term is an L 2 oss function for the empirica risk, whereas the second term caed stabiizer is usuay written as a functiona Ω(f) with certain properties [92, 69, 39]. Here we consider a specia cass of stabiizers, that is the norm f 2 K in a RKHS induced by a symmetric, positive definite function K(x, y). This choice aows us to deveop a framework of reguarization which incudes most of the usua reguarization schemes. The ony significant omission in this treatment that we make here for simpicity is the restriction on K to be symmetric positive definite so that the stabiizer is a norm. However, the theory can be extended without probems to the case in which K is positive semidefinite, in which case the stabiizer is a semi-norm [102, 56, 31, 33]. This approach was aso sketched in [90]. The stabiizer in equation (32) effectivey constrains f to be in the RKHS defined by K. It is possibe to show (see for exampe [69, 39]) that the function that minimizes the functiona (32) 16

has the form: f(x) = c i K(x, x i ), (33) where the coefficients c i depend on the data and satisfy the foowing inear system of equations: (K + λi)c = y (34) where I is the identity matrix, and we have defined (y) i = y i, (c) i = c i, (K) ij = K(x i, x j ). It is remarkabe that the soution of the more genera case of min H[f] =1 V (y i f(x i )) + λ f 2 f H K, (35) where the function V is any differentiabe function, is quite simiar: the soution has exacty the same genera form of (33), though the coefficients cannot be found anymore by soving a inear system of equations as in equation (34) [37, 40, 90]. For a proof see Appendix C. The approximation scheme of equation (33) has a simpe interpretation in terms of a network with one ayer of hidden units [71, 39]. Using different kernes we get various RN s. A short ist of exampes is given in Tabe 1. Kerne Function Reguarization Network K(x y) =exp( x y 2 ) Gaussian RBF K(x y) =( x y 2 + c 2 ) 1 2 Inverse Mutiquadric K(x y) =( x y 2 + c 2 ) 1 2 Mutiquadric K(x y) = x y 2n+1 Thin pate spines K(x y) = x y 2n n( x y ) K(x, y) =tanh(x y θ) (ony for some vaues of θ) Muti Layer Perceptron K(x, y) =(1+x y) d Poynomia of degree d K(x, y) =B 2n+1 (x y) B-spines K(x, y) = sin(d+1/2)(x y) sin (x y) 2 Trigonometric poynomia of degree d Tabe 1: Some possibe kerne functions. The first four are radia kernes. The mutiquadric and thin pate spines are positive semidefinite and thus require an extension of the simpe RKHS theory of this paper. The ast three kernes were proposed by Vapnik [96], originay for SVM. The ast two kernes are one-dimensiona: mutidimensiona kernes can be buit by tensor products of one-dimensiona ones. The functions B n are piecewise poynomias of degree n, whoseexact definition can be found in [85]. When the kerne K is positive semidefinite, there is a subspace of functions f which have norm f 2 K equa to zero. They form the nu space of the functiona f 2 K and in this case the minimizer of (32) has the form [102]: 17

k f(x) = c i K(x, x i )+ b α ψ α (x), (36) α=1 where {ψ α } k α=1 is a basis in the nu space of the stabiizer, which in most cases is a set of poynomias, and therefore wi be referred to as the poynomia term in equation (36). The coefficients b α and c i depend on the data. For the cassica reguarization case of equation (32), the coefficients of equation (36) satisfy the foowing inear system: (K + λi)c +Ψ T b = y, (37) Ψc =0, (38) where I is the identity matrix, and we have defined (y) i = y i, (c) i = c i, (b) i = b i, (K) ij = K(x i, x j ), (Ψ) αi = ψ α (x i ). When the kerne is positive definite, as in the case of the Gaussian, the nu space of the stabiizer is empty. However, it is often convenient to redefine the kerne and the norm induced by it so that the induced RKHS contains ony zero-mean functions, that is functions f 1 (x)s.t. X f 1(x)dx =0. In the case of a radia kerne K, for instance, this amounts to considering a new kerne K (x, y) =K(x, y) λ 0 without the zeroth order Fourier component, and a norm f 2 K = n=1 a 2 n λ n. (39) The nu space induced by the new K is the space of constant functions. Then the minimizer of the corresponding functiona (32) has the form: f(x) = c i K (x, x i )+b, (40) with the coefficients satisfying equations (37) and (38), that respectivey become: (K + λi)c + 1b =(K λ 0 I + λi)c + 1b =(K +(λ λ 0 )I)c + 1b = y, (41) c i =0. (42) Equations (40) and (42) impy that the the minimizer of (32) is of the form: f(x) = c i K (x, x i )+b = c i (K(x, x i ) λ 0 )+b = c i K(x, x i )+b. (43) Thus we can effectivey use a positive definite K and the constant b, since the ony change in equation (41) just amounts to the use of a different λ. Choosingtouseanon-zerob effectivey 18

means choosing a different feature space and a different stabiizer from the usua case of equation (32): the constant feature is not considered in the RKHS norm and therefore is not penaized. This choice is often quite reasonabe, since in many regression and, especiay, cassification probems, shifts by a constant in f shoud not be penaized. In summary, the argument of this section shows that using a RN of the form (43) (for a certain cass of kernes K) is equivaent to minimizing functionas such as (32) or (35). The choice of K is equivaent to the choice of a corresponding RKHS and eads to various cassica earning techniques such as RBF networks. We discuss connections between reguarization and other techniques in sections 4.2 and 4.3. Notice that in the framework we use here the kernes K are not required to be radia or even shift-invariant. Reguarization techniques used to sove supervised earning probems [69, 39] were typicay used with shift invariant stabiizers (tensor product and additive stabiizers are exceptions, see [39]). We now turn to such kernes. 4.1 Radia Basis Functions Let us consider a specia case of the kerne K of the RKHS, which is the standard case in severa papers and books on reguarization [102, 70, 39]: the case in which K is shift invariant, that is K(x, y) =K(x y) and the even more specia case of a radia kerne K(x, y) =K( x y ). Section 3 impies that a radia positive definite K defines a RKHS in which the features φ n are Fourier components that is K(x, y) λ n φ n (x)φ n (y) λ n e i2πn x e i2πn y. (44) n=0 n=0 Thus any positive definite radia kerne defines a RKHS over [0, 1] with a scaar product of the form: <f,g> H n=0 f(n) g (n) λ n, (45) where f is the Fourier transform of f. The RKHS becomes simpy the subspace of L 2 ([0, 1] d )of the functions such that f 2 K = f(n) 2 < +. (46) n=1 λ n Functionas of the form (46) are known to be smoothness functionas. In fact, the rate of decrease to zero of the Fourier transform of the kerne wi contro the smoothness property of the function in the RKHS. For radia kernes the minimizer of equation (32) becomes: f(x) = c i K( x x i )+b (47) and the corresponding RN is a Radia Basis Function Network. Thus Radia Basis Function networks are a specia case of RN [69, 39]. In fact a transation-invariant stabiizers K(x, x i )=K(x x i ) correspond to RKHS s where the basis functions φ n are Fourier eigenfunctions and ony differ in the spectrum of the eigenvaues (for a Gaussian stabiizer the spectrum is Gaussian, that is λ n = Ae ( n2/2) (for σ = 1)). For 19

exampe, if λ n =0foran>n 0, the corresponding RKHS consists of a bandimited functions, that is functions with zero Fourier components at frequencies higher than n 0 18. Generay λ n are such that they decrease as n increases, therefore restricting the cass of functions to be functions with decreasing high frequency Fourier components. In cassica reguarization with transation invariant stabiizers and associated kernes, the common experience, often reported in the iterature, is that the form of the kerne does not matter much. We conjecture that this may be because a transation invariant K induce the same type of φ n features - the Fourier basis functions. 4.2 Reguarization, generaized spines and kerne smoothers A number of approximation and earning techniques can be studied in the framework of reguarization theory and RKHS. For instance, starting from a reproducing kerne it is easy [5] to construct kernes that correspond to tensor products of the origina RKHS; it is aso easy to construct the additive sum of severa RKHS in terms of a reproducing kerne. Tensor Product Spines: In the particuar case that the kerne is of the form: K(x, y) =Π d j=1 k(xj,y j ) where x j is the jth coordinate of vector x and k is a positive definite function with onedimensiona input vectors, the soution of the reguarization probem becomes: f(x) = i c i Π d j=1 k(xj i,x j ) Therefore we can get tensor product spines by choosing kernes of the form above [5]. Additive Spines: In the particuar case that the kerne is of the form: d K(x, y) = k(x j,y j ) j=1 where x j is the jth coordinate of vector x and k is a positive definite function with onedimensiona input vectors, the soution of the reguarization probem becomes: f(x) = i d d c i ( k(x j i,x j )) = j=1 j=1( i d c i k(x j i,x j )) = f j (x j ) j=1 So in this particuar case we get the cass of additive approximation schemes of the form: d f(x) = f j (x j ) j=1 A more extensive discussion on reations between known approximation methods and reguarization can be found in [39]. 18 The simpest K is then K(x, y) =sinc(x y), or kernes that are convoution with it. 20