Some Properties of Regularized Kernel Methods

Size: px

Start display at page:

Download "Some Properties of Regularized Kernel Methods"

Elinor Young
5 years ago
Views:

1 Journa of Machine Learning Research 5 (2004) Submitted 12/03; Revised 7/04; Pubished 10/04 Some Properties of Reguarized Kerne Methods Ernesto De Vito Dipartimento di Matematica Università di Modena Modena, Itay and INFN, Sezione di Genova Genova, Itay Lorenzo Rosasco Andrea Caponnetto DISI Università di Genova, Genova, Itay Michee Piana DIMA Università di Genova, Genova, Itay DEVITO@UNIMO.IT ROSASCO@DISI.UNIGE.IT CAPONNETTO@DISI.UNIGE.IT PIANA@DIMA.UNIGE.IT Aessandro Verri DISI Università di Genova, Genova, Itay VERRI@DISI.UNIGE.IT Editor: Aexander J. Smoa Abstract In reguarized kerne methods, the soution of a earning probem is found by minimizing functionas consisting of the sum of a data and a compexity term. In this paper we investigate some properties of a more genera form of the above functionas in which the data term corresponds to the expected risk. First, we prove a quantitative version of the representer theorem hoding for both regression and cassification, for both differentiabe and non-differentiabe oss functions, and for arbitrary offset terms. Second, we show that the case in which the offset space is non trivia corresponds to soving a standard probem of reguarization in a Reproducing Kerne Hibert Space in which the penaty term is given by a seminorm. Finay, we discuss the issues of existence and uniqueness of the soution. From the speciaization of our anaysis to the discrete setting it is immediate to estabish a connection between the soution properties of sparsity and coefficient boundedness and some properties of the oss function. For the case of Support Vector Machines for cassification, we aso obtain a compete characterization of the whoe method in terms of the Khun-Tucker conditions with no need to introduce the dua formuation. Keywords: statistica earning, reproducing kerne Hibert spaces, convex anaysis, representer theorem, reguarization theory c 2004 Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Michee Piana,and Aessandro Verri.

2 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI 1. Introduction The probem of earning from exampes can be seen as the probem of estimating an unknown functiona dependency given ony a finite (possiby sma) number of instances. The semina work of Vapnik Vapnik (1988) shows that the key to effectivey sove this probem is by controing the compexity of the soution. In the context of statistica earning this eads to techniques known as reguarization networks (Evgeniou et a., 2000) or reguarized kerne methods (Vapnik, 1988; Cristianini and Shawe Tayor, 2000; Schökopf and Smoa, 2002). More precisey, given a training set S = (x i,y i ) i=1 of pairs of exampes, the estimator is defined as f λ S argmin f H { 1 i=1 V(y i, f(x i ))+λ f 2 H }, (1) where V is the oss function, H is the Hibert space of the hypotheses and λ > 0 is the reguarization parameter. As shown by Evgeniou et a. (2000) the above minimization probem can aso be seen as particuar instance of Tikhonov Reguarization (Tikhonov and Arsenin, 1977; Mukherjee et a., 2002) for a mutivariate function approximation probem which is we known to be i-posed (Bertero et a., 1988; Evgeniou et a., 2000; Poggio and Smae, 2003). In this paper we study the generaization of the above probem to the continuous setting, that is, given a probabiity distribution ρ defined on X Y where X is the input space and Y is the output space, we study the properties of the estimator ( f λ,g λ ) argmin ( f,g) H B { V(y, f(x)+g(x))dρ(x,y)+λ f 2 H }, (2) where H and B are reproducing kerne Hibert spaces (RKHS): H is the space of penaized functions and B is the offset space (Wahba, 1990). Considering the continuous setting is meaningfu for severa reasons. First, it is usefu in order to study the probem of the generaization properties of kerne methods (Steinwart, 2002). To this purpose, one associates with each function f : X R its expected risk, I[ f] = V(y, f(x))dρ(x,y), where ρ is the unknown probabiity distribution describing the reation between the input x X and the output y Y. Foowing Cucker and Smae (2002), for reguarized kerne methods the discrepancy between the expected risk of the estimator, fs λ, and the minimum obtainabe risk, inf f H I[ f], can be decomposed as ( ) ( ) I[ fs λ ] inf I[ f] = I[ fs λ ] I[ f λ ] + I[ f λ ] inf I[ f], f H f H where the first term represents the sampe error and the second term the approximation error (Niyogi and Girosi, 1999). Ceary, insight on the form of f λ can be usefu to obtain better bounds on both errors. Second, considering the continuous measure ρ corresponds intuitivey to finding a stabe soution to the earning probem in the case of infinite number of exampes and, hence, gives information about the best we can do in the hypothesis space H B (Mukherjee et a., 2002). Third, 1364

3 SOME PROPERTIES OF REGULARIED KERNEL METHODS we can treat both the empirica measure and the idea unknown probabiity distribution in a unified framework. The contribution of our work is threefod. First we provide a compete characterization of the expicit form of the estimator ( f λ,g λ ) given by Eq. (2) by expoiting a convexity assumption on the oss functions. Our resut can be interpreted as a quantitative version of the representer theorem hoding for both regression and cassification and in which expicit care is taken of the offset space B. Then, we discuss the roe of the offset space B. The starting point of our discussion is the obvious observation that the estimator given by Probem (2) is not the pair ( f λ,g λ ) but the sum f λ + g λ. In other words the natura hypothesis space is the sum H + B instead of the product H B (which is not even a space of functions from X to R). For arbitrary oss function we prove that Probem (2) is equivaent to a kerne method defined on H + B, which is a RKHS, with a penaty term given by a seminorm. Finay, for sake of competeness, we study the issues of the existence and uniqueness for Probem (2). When B is not the empty set, both issues are not trivia. In particuar, for B equa to the set of constants, we prove existence under very reasonabe conditions: for exampe, for cassification, one needs at east two exampes with different abes. About uniqueness we show that, for stricty convex oss functions, one has uniqueness if and ony if the space B is sma enough to be separated by the measure ρ: for exampe, in the discrete setting, this ast condition means that a function g B is equa to 0 if and ony if g(x i ) = 0 for a i. For the hinge oss function, which is convex but not stricty convex, we give an ad hoc condition in terms of number of support vectors of the two casses. The pan of the paper is as foows. In Section 2 we discuss our contributions with respect to previous works. In Section 3 we introduce some basic concepts of earning theory and state the assumptions we make on the oss function V and hypothesis spaces H and B. In Section 4 we study the form of the soution of Probem (2). In Section 5 we discuss the theoretica meaning of the offset space B. We discuss the probem of existence and uniqueness in Section 6. In Section 7 we appy our resuts to the discrete setting and focus on the case of Support Vector Machines. In the appendix we reca some notions from convex anaysis in infinite dimensiona spaces. 2. Putting Our Work in Context We now briefy discuss the reation between our resuts and the previous works on this subject. Resuts about the form of the soution of kerne methods are known in the iterature as representer theorems (if B is not trivia they are caed semiparametric representer theorems). The first resut in this direction is due to Kimedorf and Wahba (1970) for the squared oss function (see aso Wahba, 1990). However, the structure of the proof hods for arbitrary oss function as shown by many authors such as Cox and O Suivan (1990). In the framework of statistica earning, Schökopf et a. (2001) give a proof of the representer theorem that hods for an arbitrary oss function and for any penaty term, being it a stricty increasing function of the norm. This kind of resuts shows that, if the H is a RKHS with kerne K, the estimator fs λ defined by Eq. (1) can be written as f λ S (x) = i=1 α i K(x,x i ). The above resut hods for arbitrary oss function and for a arge cass of penaty terms. However, the form of the coefficients α i is unknown. 1365

4 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI For the squared oss function, the form of the coefficients is we known in the context of inverse probem, see, for exampe, Tikhonov and Arsenin (1977), and reduces to sove a inear system of equations. For arbitrary differentiabe functions, this probem was studied by Poggio and Girosi (1992); Girosi (1998); Wahba (1998) where the coefficients α i are soution of a system of agebraic equations. This approach cannot be appied to hinge and ε-insensitive oss function (Vapnik, 1988), since they are not differentiabe: the form of the coefficients α i is recovered ony through the usua dua Lagrangian formuation of the minimization probem, see, for exampe, Vapnik (1988); Cristianini and Shawe Tayor (2000). Recenty, hang (2001) gives a quantitative representer theorem in the cassification setting that hods for differentiabe oss function and Steinwart (2003) extends this resut for arbitrary convex oss function, without using the dua probem. In these papers the form of the coefficients α i is given in terms of a cosed equation invoving the subgradient of the oss function. Moreover, they are abe to extend the representer theorem to the continuous setting (a study of the soution of Tikhonov reguarization in the continuous setting when the square oss is used can be found aso in Cucker and Smae, 2002). This paper, using techniques simiar to those of Steinwart (2003), extends the above resut in the foowing directions: our resut hods both for regression and cassification; we provide a genera resut that hods aso when the offset term is considered. The presence of the offset space forces the coefficients α i to satisfy a system of inear equations; we do not assume that input space X and the output space Y are compact. In particuar, for regression we can assume Y = R; we provide a simper proof than the one of Steinwart (2003) by using known resuts about integra convex functionas. A discussion of the roe of the offset terms can be found in Evgeniou et a. (2000) and in Poggio et a. (2002) when the space B reduces to the set of constant functions. The resuts are cose to our Theorem 6, but they are proved assuming that the unit constant is in the Mercer decomposition of the kerne and for the discrete setting, whie our resut hods true for offset term iving in arbitrary RKHS. The probem of the existence and uniqueness is discussed in Wahba (1998) for the discrete setting and with differentiabe oss functions. For arbitrary ρ the papers by Steinwart (2002, 2003) study the existence for the cassification setting with offset space reduced to the constant functions. For the hinge oss and ε-insensitive oss, the probem of uniqueness is treated in Burges and Crisp (2000, 2003). Their proof is based on the dua probem and on the Kuhn-Tucker conditions. Our resuts subsume the cited resuts as specia cases, but are a obtained in the more genera continuous setting. In particuar our resuts on uniqueness of SVM soution are simiar to those in Burges and Crisp (2000, 2003) but do not make use of the dua formuation. 3. Notation and Assumptions In this section we first fix the notation and then state and comment upon the basic assumptions needed to derive the resuts described in the rest of the paper. We start with input and output spaces. 1366

5 SOME PROPERTIES OF REGULARIED KERNEL METHODS 3.1 Input and Output Spaces As usua, we denote with X and Y the input and output spaces respectivey. We assume that X is a ocay compact second countabe space (this assumption is satisfied for instance if X is a cosed subset of R d ) and Y is a cosed subspace of R. We et = X Y and endow it with a probabiity distribution ρ defined on the Bore σ-agebra of. We reca that, since ρ is a bounded measure and is second countabe, ρ is a Radon measure. In practice, ρ wi be either the unknown distribution describing the reation between x and y or the empirica measure ρ S = 1 i=1 δ (xi,y i ), associated with the training set S = {(x i,y i )} i=1 drawn i.i.d. with respect to ρ. We now dea with oss functions. 3.2 Loss Functions We coect the mathematica assumptions on the oss function in the foowing definition and we comment on the purpose of each assumption. Definition 1 Given p [1,+ [, a function V : Y R [0,+ [ such that 1. for a y Y the function V(y, ) is convex on R; 2. the function V is measurabe on Y R; 3. there are b [0,+ [ and a : Y [0,+ [ such that V(y,w) a(y)+b w p w R, y Y (3) a(y)dρ(x, y) < +, (4) is caed a p-oss function with respect to ρ. If the context is cear, V is simpy caed a oss function. The convexity hypothesis is not restrictive, being satisfied by a the oss functions commony in use. Moreover, it is powerfu from a technica point of view: it aows for the use of subgradient techniques without assuming differentiabiity of V and makes it possibe to use convex anaysis toos in the study of existence and uniqueness of functiona minimizers. Finay, this requirement ensures stronger bounds for the sampe error (Bartett et a., 2002; Bartett, 2003; Bartett et a., 2003). Assumption 2. is a minima requirement for defining the expected risk and it is usuay satisfied since oss functions commony in use are continuous on. Condition 3. is a technica hypothesis we need in order to use resuts from convex integra functiona anaysis. For exampe, it is satisfied in the foowing cases 1. for p = 2, if V is the square oss function, V(y,w) = (y w) 2, and y 2 dρ(x,y) < + ; 1367

6 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI 2. for p = 1, if V(y, ) is Lipschitz on R with a Lipschitz constant independent of y and V(y,0)dρ(x,y) < +. We now restrict our anaysis to some functionas studied in statistica earning. 3.3 Learning Functionas The expected risk of a measurabe function f : X R is defined as I[ f] = V(y, f(x))dρ(y,x), and can be seen as the average error obtained by the function f, where f is a possibe soution of the earning probem and the probabiity measure ρ is unknown. Given a training set S, a possibe way to estimate I[ f] is to evauate the empirica risk I S emp[ f] = 1 i=1 V(y i, f(x i )). The probem of earning is to find, given the training set S, an estimator f effectivey predicting the abe of a new point. This transates in finding a function f such that its expected risk is sma with high probabiity. A possibe way to efficienty sove the earning probem is provided by reguarized kerne methods which amounts to soving a probem of functiona minimization as Probem (1). A generaization of Probem (1) to a continuous setting is provided by Probem (2) in which the continuous measure ρ repaces the empirica measure ρ S in the first term. In what foows we wi refer to the functionas to be minimized in both Eq. (1) and Eq. (2) as Tikhonov functionas and to the soutions as the reguarized soutions. The second term of a Tikhonov functiona is a smoothness or a compexity term measuring the norm of the function f in a suitabe Hibert space H. The minimization takes pace in the hypothesis space H B. We now coect the assumptions on the hypothesis space at the basis of our anaysis. 3.4 Hypothesis Space First of a, we reca the definition of reproducing kerne Hibert space. A RKHS H on X with kerne K : X X R is defined as the unique Hibert space of rea vaued functions on X such that, for a f H, f(x) = f,k x H x X, (5) where K x is the function on X defined by K x (s) = K(x,s). Given a probabiity meausure ρ on and p [1,+ [, we say that the kerne K is p-bounded with respect to ρ if the function K is measurabe on X X and K(x,x) p 2 dρ(x,y) < +. (6) Ceary the above condition depends ony on the margina distribution of ρ on X and ensures that H is a subspace of L p (,ρ) with continuous incusion (see Lemma 4 in Section 4). This fact is 1368

7 SOME PROPERTIES OF REGULARIED KERNEL METHODS essentia for proving our resuts. In particuar, the p-boundedness of the kerne is fufied for a p [1,+ [ if X is compact and the kerne is continuous or if the kerne is measurabe and bounded. We are now ready to discuss the assumptions on the hypothesis space. We fix the probabiity measure ρ on and p [1,+ [ such that V is p-bounded with respect to ρ. We require that the space of penaized functions H and the space of offset functions B are RKHS on X such that the corresponding kernes K and K B are p-bounded with respect to ρ. We denote the corresponding norms by H and B. Finay, we notice that, in genera, the product space H B is not a RKHS. In earning theory usuay X is compact, K is continuous and B is the one dimensiona vector space of constant functions B = { f : X R f(x) = b, b R} = R with kerne K B simpy given by K B (x,s) = 1. Another exampe of offset space, which arises in approximation probems in RKHS on a bounded interva, is the space of spines of order n, whose corresponding kerne is continuous (Wahba, 1990). In both case the p-boundedness assumption is satisfied for a p. Our framework aows to treat arbitrary (possiby infinite-dimensiona) offset spaces with the possibiity to incorporate jumps in the offset term. Finay, the requirement that the hypothesis space is a RKHS is due to the fact that minimization of a convex functiona in a Hibert space is easier to treat than in an arbitrary Banach space since in the former case the subgradient of the functiona is an eement of the space itsef. Moreover, in the proofs we use extensivey the reproducing property given by Eq. (5). 4. Expicit Form of the Reguarized Soution In this section we determine the expicit form of the minimizer of the Tikhonov functiona introduced in the previous section. We first state the main theorem and comment on the obtained resut, then we provide the mathematica proof. 4.1 Main Theorem Theorem 2 Let ρ be a probabiity measure on X Y where X is a ocay compact second countabe space and Y is a cosed subset of R. Let V be a p-oss function with respect to ρ, p [1,+ [. Let H and B reproducing kerne Hibert spaces such that the corresponding kernes K and K B are p-bounded with respect to ρ. Define q =]1,+ ] such that 1 q + 1 p = 1. Let λ > 0 and ( f λ,g λ ) H B, then { } ( f λ,g λ ) argmin V(y, f(x)+g(x))dρ(x,y)+λ f 2 H ( f,g) H B if and ony if there is α L q (,ρ) satisfying α(x,y) ( V)(y, f λ (x)+g λ (x)) (x,y) X Y a.e. (8) f λ (s) = 1 K(s,x)α(x,y)dρ(x,y) s X (9) 2λ 0 = K B (s,x)α(x,y)dρ(x,y) s X. (10) (7) 1369

8 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI The proof of this theorem is given in the foowing subsection. A few important remarks are in order. First, the theorem gives a genera quantitative version of the representer theorem. The generaity is obtained by considering the continuous setting which subsumes the discrete setting if the measure ρ is the empirica measure ρ S. In this case, the integra reduces to a finite sum and we recover the we known resut that f λ S = i=1 α ik xi, where the x i form the training set. Moreover, the soution is quantitativey characterized since the coefficients α are given by Eq. (8) invoving the subgradient. For differentiabe oss functions in the discrete setting, Eq. (8) reduces to α i = V (y i, f λ S (x i )+g λ S(x i )), where V denotes the derivative with respect to the second variabe (Girosi, 1998; Wahba, 1998). Second, if {ψ i } m i=1 is a base for B, the offset part of the soution can be written as gλ = m i=1 d iψ i, where the coefficients d i are again constrained by Eq. (8). A discussion on how to sove expicity Eq. (8) can be found in Wahba (1998). Furthermore, the presence of B induces a system of inear constraints on the coefficients α i expressed by Eq. (10) that, for B = R, reduces to the we known condition α i = 0. i=1 We stress that, unike previous works, the above equation has been derived without introducing the dua formuation. Finay, we discuss the roe of Assumption 3) in Definition 1. From the proof, it is apparent that this assumption is needed to ensure the continuity of the first term in the Tikhonov functiona which in the discrete setting is triviay guaranteed. Therefore, for the discrete setting Theorem 2 hods for any convex oss function. In particuar, L q (,ρ S ) = R and the condition α L q (,ρ S ) is aways satisfied. Back to the continuous setting, if V(y, ) is Lipschitz on R with a Lipschitz constant independent of y and V(y,0)dρ(x,y) < +, one can choose p = 1, so that q = + and condition α L (,ρ) means that α is bounded. For the square oss, ceary p = 2, so that q = 2 and α is square-integrabe. As shown by Steinwart (2003), for cassification and compact X, one can again remove Assumption 3) of Definition 1 using the fact that a convex function is ocay Lipschitz and the range of possibe y is bounded. The foowing coroary is the restatement of the representer theorem without offset space. Coroary 3 With the assumptions of Theorem 2, et f λ H then { } f λ argmin V(y, f(x))dρ(x,y)+λ f 2 H f H if and ony if there is α L q (,ρ) satisfying α(x,y) ( V)(y, f λ (x)) (x,y) X Y a.e. f λ (s) = 1 K(s,x)α(x,y)dρ(x,y) s X. 2λ 1370

9 SOME PROPERTIES OF REGULARIED KERNEL METHODS 4.2 Proof of the Main Theorem Before giving the proof of the theorem we discuss the proof structure, which aside from some technicaities is very simpe, and is based on two emmas. The Tikhonov functiona I[ f + g] + λ f 2 H is a convex map on H B, so ( f λ,g λ ) is a minimizer of the Tikhonov functiona if and ony if (0,0) is in its subgradient, which is a subset of H B. Using inearity, the computation of the subgradient of the Tikhonov functiona reduces to the computation of the subgradient of I[ f + g] and f 2 H respectivey. Since the atter functiona is differentiabe, the subgradient evauation is straightforward. Some care is needed for the subgradient of the former. First, we rewrite it as an integra functiona on L p (,ρ) and then use a fundamenta resut of convex anaysis to interchange the integra and the subgradient. Proof [of Theorem 2] Ceary, λ f 2 H is continuous and, by Lemma 4, the functiona I[ f + g] is continuous and finite. So, from item 5 of Proposition 14, one has that Now, the map ( I[ f + g]+λ f 2 H ) = (I[ f + g])+λ ( f 2 H ). ( f,g) f 2 H is differentiabe with derivative (2 f,0) and, therefore, by item 1 of Proposition 14, ( f 2 H ) = {(2 f,0)}. (11) The main difficuty is the evauation of the subgradient of the map I[ f + g] given in Lemma 5. By means of this emma we obtain that the eements of the subgradient of I[ f + g] at ( f,g) are of the form ( ) K(x, )α(x,y)dρ(x,y), K B (x, )α(x,y)dρ(x,y), (12) where α L q (,ρ) satisfies α(x,y) ( V)(y, f(x)+g(x)) (13) for ρ-amost a (x,y) X Y. Now, by combining Eq. (11) and Eq. (12), we have that the eements of the subgradient of I[ f + g]+λ f 2 H at point ( f,g) are of the form ( ) K(x, )α(x,y)dρ(x,y)+2λ f, K B (x, )α(x,y)dρ(x,y). (14) where α L q (,ρ) satisfies Eq. (13). From item 3 of Proposition 14, we have that an eement ( f λ,g λ ) H B is a minimizer of I[ f + g]+λ f 2 H if and ony if (0,0) beongs to the subgradient evauated at ( f λ,g λ ). Using Eq. (14), one has that f λ (s) = 1 α(x, y)k(x, s)dρ(x, y) 2λ α(x,y)k B (x,s)dρ(x,y) = 0. where, by means of Eq. (13), α L q (,ρ) satisfies Eq. (8). This ends the proof. 1371

10 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI Before computing the subgradient of the map I[ f + g] in Lemma 5, we need to extend the definition of expected risk on L p (,ρ). First of a, we et I 0 [u] = V(y,u(x,y)) dρ(x,y) u L p (,ρ), so that I[ f + g] = I 0 (J(f,g)) where J : H B L p (,ρ) is the inear map J(f,g) = f + g, (the function f + g is viewed in a natura way as a function on ). The foowing emma coects some technica facts on I 0 and J. Lemma 4 With the above notations, 1. the functiona I 0 : L p (,ρ) [0,+ [ is we-defined and continuous; 2. the operator J : H B L p (,ρ) is we-defined and continuous. Proof Since the oss function V can be regarded as function on R, that is, V(z,w) = V(y,w) where z = (x,y), one has that I 0 [u] is the Nemitski functiona associated with V (see Appendix), that is, I 0 [u] = V(z,u(z))dρ(z) u L p (,ρ). We caim that I 0 [u] is finite. Indeed, given u L p (,ρ), by Eq. (3), V(y,u(z))dρ(x,y) a(y)+b u(z) p dρ(x,y) < +. The proof that I 0 is continuous can be found in Proposition III.5.1 of Ekeand and Turnbu (1983). In order to prove the second item, we et f H. Then, by Eq. (5), f(x) p dρ(x,y) = f,k x H p dρ(x,y) f p H K(x,x) p 2 dρ(x,y) = C f p H < +. where C = R K(x,x) p 2 dρ(x,y) is finite since K is p-bounded (see Eq. (6)). In particuar, the function (x,y) f(x) is in L p (,ρ) and f L p p C f H. The same reation ceary hods for g B. It foows that J is we defined and Since J is inear, it foows that J is continuous. f + g L p p C f H + p C g B. Finay, the foowing emma computes the subgradient of I = I 0 J. 1372

11 SOME PROPERTIES OF REGULARIED KERNEL METHODS Lemma 5 With the above notations, et ( f,g) H B, then (φ,ψ) (I 0 J)( f,g) if and ony if there is α L q (,ρ) such that α(x,y) φ(s) = ψ(s) = ( V)(y, f(x)+g(x)) (x,y) X Y a.e. K(s,x)α(x,y)dρ(x,y) s X K B (s,x)α(x,y)dρ(x,y) s X. Proof Since I 0 is finite and continuous in 0 = J(0), by point 6 of Proposition 14, we know that where J : L q (,ρ) H B is the adjoint of J, that is, (I 0 J)( f,g) = J ( I 0 )(J(f,g)), (15) J α,( f,g) H B = α(x,y)j(f,g)(x,y) dρ(x,y). First of a, we compute I 0. Since I 0 [0] < +, we can appy Proposition 15 so that, given u L p (,ρ), then α ( I 0 )(u) if and ony if α L q (,ρ) and α(z) ( V)(y,u(x,y)), for ρ-amost a (x,y) X Y. We now compute the adjoint of J. Let α L q (,ρ) and (φ,ψ) = J α H B. Using the reproducing property of H and the definition of J we can write φ(s) = φ,k s H = J α,(k s,0) H B = α,j(k s,0) L 2 (,ρ). Writing the scaar product expicity we then find φ(s) = K(s, x)α(x, y)dρ(x, y). Reasoning in the same way we find that ψ(s) = K B (s,x)α(x,y)dρ(x,y). Repacing the above formuas in Eq. (15), we have the thesis. 5. Deaing with the Offset Space B In this section we dea with the offset term which often appears in reguarized soutions. We first motivate our anaysis, then state and discuss our main resut on this issue. Finay, we give the proof of the resuts. 1373

12 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI 5.1 Motivations In the previous section we minimized a Tikhonov functiona on the set H B, deaing expicity with the possibe presence of an offset term in the form of the soution. Typica exampes in which offset spaces arise are Support Vector Machine agorithms (Vapnik, 1988), where the offset term is a constant accounting for the transation invariance of the separating hyperpane, and penaization methods (Wahba, 1990), where the offset space is the kerne space of the penaization operator. However, the fact that the set H B is not a RKHS (in fact, it is not even a function space) makes it cumbersome to extend of typica statistica earning resuts to the genera setting in which the offset term is considered. For exampe a separate anaysis, with and without the offset term, is needed for measuring the compexity of the hypothesis space or studying agorithm consistency. In this section we show that under very weak conditions the presence of an offset term is equivaent to soving a standard reguarization probem with a seminorm (Wahba, 1990). The fact that the estimator is f λ (x)+g λ (x) (for regression) or sgn ( f λ (x)+g λ (x) ) (for cassification) suggests to repace H B with the sum S = H + B = { f + g f H, g B}. The hypothesis space S is a space of functions on X and, in particuar, a RKHS, the kerne being the sum of the kernes of H and B. In this section we show that the minimization of a Tikhonov functiona on H B is essentiay equivaent to the minimization of an appropriate functiona on S. This provides a rigorous derivation of the foowing facts. 1. The equivaent functiona on S is aso a Tikhonov functiona. The penaty term is a seminorm penaizing the functions in S orthogona to B ony. 2. The estimator given by the minimization of the Tikhonov functiona on S depends ony on the kerne sum. Moreover, since the hypothesis space S is a RKHS, a number of cassica resuts of earning theory foows without further effort. Finay, we notice that the norm of B (hence the kerne K B ) pays no roe in the functiona I[ f + g]+λ f 2 H, that is, a kernes, whose corresponding RKHS is B as a vector space, give rise to the same minimizers ( f λ,g λ ). This fact is confirmed by Eq. (18) beow (see aso Eq. (20)). 5.2 Main Theorem We reca that the norm in S is given by f + g 2 S = ( inf f 2 f H,g B H + g B) 2 f+g= f +g (16) and, with respect to this norm, S is a RKHS on X with kerne K + K B (Schwartz, 1964). We are now ready to state the foowing resut. 1374

13 SOME PROPERTIES OF REGULARIED KERNEL METHODS Theorem 6 Let Q be the orthogona projection on the cosed subspace of S S 0 = {s S s,g S = 0 g B}, that is the subset of functions orthogona to B w.r.t. the scaar product in S. We have the foowing facts. 1. If ( f λ,g λ ) H B is a soution of the probem min {I[ f + g]+λ f ( f,g) H B 2 H }, then s λ = f λ + g λ S is a soution of the probem and f λ = Qs λ. 2. If s λ S is a soution of the probem et f λ = Qs λ and g λ = s λ Qs λ, then min s S {I[s]+λ Qs 2 S } min s S {I[s]+λ Qs 2 S }, I[ f λ + g λ ]+λ f λ 2 = inf {I[ f + g]+λ f H ( f,g) H B 2 H }. In particuar, if g λ B, then ( f λ,g λ ) H B is a minimizer of I[ f + g]+λ f 2 H. Before giving the proof in the foowing subsection we comment on this resut. First, notice that if H B = {0} then S = H B and f + g 2 S = f 2 H + g 2 B. In this case the theorem is trivia. However, in the arbitrary case care is needed because there are functions in H not orthogona to B. Moreover, the norm S restricted to H and B coud be different from H and B : in particuar, it coud happen that (B ) B, where the orthogonaity is meant with respect to the dot product in S. This pathoogy is at the root of the fact that there are cases in which the probem min s S {I[s]+λ Qs 2 S } has a soution, whereas the functiona I[ f + g]+λ f 2 H does not admit a minimizer on H B (see exampe beow). In practice, since H B in most appications is finite dimensiona, this pathoogy does not occur and the minimization probem on H B is fuy equivaent to the one on S. Second, the advantage of using the penaty term f 2 H instead of Qs 2 S is that one can sove the minimization probem without knowing the expicit form of the projection Q. Conversey, the space S is the natura space to address theoretica issues. 1375

14 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI Third, we observe that since the proof does not depend on the convexity of the oss function, the theorem hods for arbitrary (positive) oss functions. However, if V satisfies the hypotheses of Definition 1, from Theorem 2 it foows that the minimizer s λ of I[s]+λ Qs 2 S is of the form ( ) s λ (s) = α(x, y) K(x,s)+K B (x,s) dρ(x,y)+g λ (s) (17) = α(x,y)k(x,s)dρ(x,y)+g λ (s) (18) where g λ B and α L q (,ρ) satisfies α(x,y) ( V)(y,s λ (x)) (19) α(x,y)k B (x,s) = 0. (20) In particuar, this impies that, given h B, one can repace the kerne K with K(x,s)+h(x)h(s), without changing the form of the minimizer s λ. For exampe if B is the set of constant functions, the two kernes K(x,s) = x s and K(x,s) = x s+1 are equivaent since both penaize the functions orthogona to 1, that is the space of inear functions. 5.3 Proof Before giving the proof of Theorem 6 we need to prove the foowing technica emma. For this purpose we reca that S 0 was defined as S 0 = {s S s,g S = 0 g B}, and Q was the corresponding orthogona projection from S onto S 0. Moreover we et H 0 be the cosed subspace of H given by H 0 = { f H f,h H = 0 h H B} and P be the corresponding orthogona projection from H onto H 0. In order to prove the main theorem we need the foowing technica emma that characterizes the space S 0. Lemma 7 Let s = f + g S with f H and g B, then and there is a sequence ( f n,g n ) H B such that with f n + g n = s. Qs = P f (21) Qs S = P f H (22) im P f f n n H = 0 (23) 1376

15 SOME PROPERTIES OF REGULARIED KERNEL METHODS Equations (21) and (22) show that S 0 and H 0 are the same Hibert space and, in particuar, Qs H. However, in genera, it coud happen that s Qs B. Equation (23) is a technica trick to overcome this pathoogy. Proof [of Lemma 7] To give the proof of the emma we need some preiminary facts. Let K be the cosed subspace of H B K = {( f,g) H B ( f,h) H = (g,h) B h H B}. It is known (Schwartz, 1964) that, given s S, there is a unique ( f,g) K such that s = f + g. Moreover for a ( f,g ) H B, s, f + g S = f, f H + g,g B. (24) From Eq. (16) one has that f S f H f H (25) First of a we caim that H 0 S 0. Ceary, if f H 0, then ( f,0) K and, by Eq. (24), for a g B, f + 0,0+g S = f,0 H + 0,g B = 0, that is f S 0. This shows the caim. Moreover, f 2 S = f + 0, f + 0 S = f, f H = f 2 H. (26) Let s = f + g with f H and g B. Ceary, f = P f + h where h H0 = ((H B) ) = H B (here denotes the orthogona compement with respect to the scaar product of H ). It foows that there is a sequence h n H B such that im h h n n H = 0. (27) Since, by Eq. (25), h h n S h h n H and Q is continuous, it foows that Qh = im n Qh n = 0, since Qh n = 0. The statements of the theorem easiy foow from the above facts. Indeed Qs = Q(P f + h+g) = QP f = P f, since P f H 0 S 0, and Equation (21) is proved. Equation (22) foows from Eq. (26). Finay et now f n = P f + h h n and g n = g+h n. Ceary, f n + g n = f + g = s, f n H and g n B and moreover Eq. (23) foows from Eq. (27). We are now ready to prove the main theorem of this section. Proof [Theorem 6] First of a we note the foowing facts. Let f H, g B and s = f + g S. By Eq. (22) I[s]+λ Qs 2 S = I[ f + g]+λ P f 2 H (28) Let ( f n,g n ) H B as in Lemma 7, then I[ f + g]+λ P f 2 H = im n ( ) I[ f n + g n ]+λ f n 2 H. 1377

16 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI From the above equaities it foows that I[s]+λ Qs 2 S = im n ( ) I[ f n + g n ]+λ f n 2 H. (29) We can now prove the first part of the theorem. Assume that ( f λ,g λ ) H B is a minimizer of I[ f + g]+λ f 2 H and et sλ = f λ + g λ. From Eq. (29) and the definition of minimizer, one has that, for a s S, I[s]+λ Qs 2 S I[ f λ + g λ ]+λ f λ 2. (30) H In particuar with the choice s = s λ, by means of Eq. (22), one has that Qs S = P f λ H f λ H, and, hence, that Qs λ = P f λ = f λ. Therefore, it foows that I[s]+λ Qs 2 S I[sλ ]+λ Qs λ 2, S that is, s λ is a minimizer of I[s]+λ Ps 2 S. Before proving the second part of the theorem we note that the foowing inequaity foows as a simpe consequence of the definition of projection. I[s]+λ Qs 2 S = I[ f + g]+λ P f 2 H I[ f + g]+λ f 2 H. (31) Assume now that s λ S is a minimizer of I[s]+λ Qs 2 S. Let f λ = Qs λ and g λ = s f λ, then, by Eq. (31) and Eq. (22), it foows that I[ f λ + g λ ]+λ f λ 2 inf {I[ f + g]+λ f H ( f,g) H B 2 H }. However, using Eq. (29) with s = f λ + g λ, one has that I[ f λ + g λ ]+λ f λ 2 inf {I[ f + g]+λ f H ( f,g) H B 2 H }. So I[ f λ +g λ ]+λ f λ 2 H is the infimum of I[ f +g]+λ f 2 H on H B. Ceary, if gλ B, it foows that ( f λ,g λ ) is a minimizer of I[ f + g]+λ f 2 H. 5.4 A Counterexampe The foowing exampe shows that in some pathoogica framework the minimization on H B is not equivaent to the one on S = H + B. Exampe 1 Let H = 2 = { f = ( f n ) n N n f 2 n < + }. The space 2 is a RKHS on N with respect to the kerne K(n,m) = δ n,m. Let B = { f 2 n n 2 f 2 n < + } with the scaar product f,g B = n 2 f n g n. n 1378

17 SOME PROPERTIES OF REGULARIED KERNEL METHODS The space B is a RKHS with respect to the kerne K B (n,m) = 1 n 2 δ n,m. Ceary, B H, so that H B = B, which is not cosed in H. Since B is dense in H, P = 0 and, by Lemma 7, Q = 0. Let V be the squared oss function and choose h = (h n ) n N H such that h B. Let ρ(n,y) = δ(y h n ) so that I[s] = s h 2 S, then I[s]+λ Qs 2 S = s h 2 S, and the minimizer is s λ = h. Moreover, by our theorem, one has that inf {I[ f + g]+λ f f H,g B 2 H } = I[sλ ]+λ Qs λ 2 = 0. S If ( f λ,g λ ) H B were a minimizer, then f λ = 0 and, hence, g λ = h, but this is impossibe since h B. 6. Existence and Uniqueness We now discuss existence and uniqueness of the reguarized soution in S. Before stating and proving the main resuts we summarize our findings and show that if the offset space is empty both existence and uniqueness are easiy obtained. Our anaysis extends existence to a cases of interest under some weak assumptions on the kerne and the oss function for both regression and cassification. Uniqueness depends criticay on the convexity assumption. For stricty convex functions we prove that the soution is unique if and ony if the offset space satisfies suitabe conditions, fufied in the case of constant offsets. For oss functions which are not stricty convex we imit our attention to the hinge oss and show that the soution is unique uness some particuar conditions on the number and ocation of the support vectors are met. In Burges and Crisp (2000, 2003) simiar resuts were obtained considering the dua formuation of the minimization probem. If the offset space is empty, strict convexity and coerciveness of the penaty term triviay impy both existence and uniqueness. Indeed, we have the foowing proposition. Proposition 8 Given λ > 0, there exists a unique soution of the probem ( ) min I[ f]+λ f 2 H. f H ( ) Proof The function I[ f]+λ f 2 H is stricty convex and continuous. Moreover I[ f]+λ f 2 H λ f 2 H + if f H goes to +. From item 4 of Proposition 14 both existence and uniqueness foow. 1379

18 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI 6.1 Existence We now consider existence. If B is not trivia, there are no genera resuts (see Wahba, 1990, for a discussion on this subject). However, if B is the set of constant functions, we derive existence of the soution in two different settings. The first proposition hods ony for cassification under the assumption that the oss function V goes to infinity when y f(x) goes to (see Condition 1 of Proposition 9 beow). Simiar resuts were obtained in Steinwart (2002). We et ν be the margina measure on X associated with ρ and suppν its support. Proposition 9 Assume that the foowing conditions hod 1. im w V(1,w) = + and im w + V( 1,w) = + 2. there is C > 0 such that K(x,x) C for a x suppν 3. ρ(x {1}) > 0 and ρ(x { 1}) > 0 Then there is at east one soution of the probem where S = H +R. min s S ( I[s]+λ Qs 2 S We observe that Assumption 2. is satisfied if X is compact and K is continuous. Assumption 3. has a very natura interpretation in the discrete setting where it simpy amounts to have one exampe for each cass. This condition is need since Assumption 1. does not requires that V goes to + when y f(x) goes to +. Typica exampe of oss function satisfying Assumption 1. is the hinge oss. The second resut hods both for regression and cassification, but it requires the oss function going to infinity when f(x) goes to ±, uniformy in y (compare Assumption 1. of Proposition 10 and Assumption 1. of Proposition 9). Proposition 10 Assume that the foowing conditions hod 1. im w ± (inf y Y V(y,w)) = there is C > 0 such that K(x,x) C for a x suppν. Then there is at east one soution of the probem where S = H +R. min s S ( I[s]+λ Qs 2 S We observe that for cassification with symmetric oss functions, as the squared oss function, this proposition gives a sharper resut than Proposition 9. We now prove Proposition 9 and omit the proof of Proposition 10 since it is essentiay the same. Proof [of Proposition 9] The idea of the proof is to show that the functiona we have to minimize goes to + when s S goes to +. With this aim, et ), ), α = min{ρ(x {1}),ρ(X { 1})}. 1380

19 SOME PROPERTIES OF REGULARIED KERNEL METHODS By assumption 3, α > 0. For a fixed M > 0, we are ooking for R > 0 such that for a s S with s S R, I[s]+λ Qs 2 S M. Due to assumption 1, there is r > 0 such that, for a w r, V(1,w) M α and, for a w r, V( 1,w) M α. We now et R = max{2(1+c) M λ,2r} and choose s S with s S R. If Qs S = Qs H R 2(1+C), then I[s]+λ Qs 2 S λ Qs 2 S R λ( 2(1+C) )2 M, M since R 2(1+C) λ. If Qs R 2(1+C), et b = s Qs R, then b Assume, for exampe, that b > 0. For a x suppν = s Qs S s S Qs S R R 2(1+C) = R2C+ 1 2C+ 2. s(x) = Qs,K x H + b b Qs H K x H R 2C+ 1 2C+ 2 R 2(1+C) C R C+ 1 2C+ 2 = R 2 r, since R r 2. By definition of r, one has that for a x suppν V( 1,s(x)) M α. Integrating both sides, we find V( 1,s(x))dρ(x, 1) M α from which it foows that X { 1} I[s]+λ Qs 2 S M. ρ(x { 1}) M The same proof hods when b < 0 repacing the integration on X { 1} with the integration on X {1}. Since M is arbitrary, we have that I[s]+λ Qs 2 S λ Qs 2 S +. Since the functiona is continuous, from item 4 of Proposition 14 the existence of the minimizer foows. 1381

20 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI 6.2 Uniqueness The first proposition competey characterizes uniqueness for stricty convex functions. Proposition 11 Let s λ be a soution of the probem min s S 1. If s is another soution, then Qs = Qs λ. ( I[s]+λ Qs 2 S 2. If V(y, ) is stricty convex for a y Y then a the minimizers are of the form s λ + g, with g S such that Qg = 0 and g(x) = 0 for ν-amost a x X. Let us comment on this proposition before providing the proof. We reca that a soution s λ is the sum of two terms: f λ = Qs λ which is orthogona to B and g λ = s λ f λ. The uniqueness of f λ (item 1) is due to the strict convexity of the penaty term. Item 2 states the genera conditions that shoud be satisfied by offset functions to obtain uniqueness on s λ : in the discrete setting one has uniqueness if and ony if the condition g(x i ) = 0 for a i impies that g is equa to zero. Ceary, if B is the space of constant functions uniqueness is ensured. We now give the proof of the proposition. Proof [of Proposition 11] 1. Let s another minimizer and assume that Qs λ Qs. Then, by the strict convexity of 2 S, one has that, for a t ]0,1[, (1 t)qs λ +tqs 2 Qs < (1 t) λ 2 +t Qs 2 S S S. Since I[s] is convex, one has that ). I[(1 t)s λ +ts ] (1 t)i[s λ ]+ti[s ]. From the above two inequaities we find I[(1 t)s λ +ts ] + λ Q ((1 t)s λ +ts ) 2 S ( < (1 t) I[s λ ]+λ Qs λ 2 S ( ) I[s]+λ Qs 2 S. = min s S Since this is impossibe, it foows that Qs λ = Qs. ) ( +t I[s ]+λ Qs ) 2 S 2. Let s = s λ + g with g as in item 1. By straightforward computation we have that s is a minimizer. It is eft to show that the minimizers are ony the functions written in the above form. From item 1 we have that Qg = 0. Let U be the measurabe set U = {x X g(x) 0} = {x X s (x) s λ (x)}. By contradiction, et us assume that ν(u) > 0 and, hence, ρ(u Y) > 0. Fix t ]0, 1[. since V(y, ) is stricty convex, for a (x,y) U Y, one has that V(y,(1 t)s λ (x)+ts (x)) < (1 t)v(y,s λ (x))+tv(y,s (x)). 1382

21 SOME PROPERTIES OF REGULARIED KERNEL METHODS Therefore, by integration, U Y V(y,(1 t)s λ (x)+ts (x))dρ(x,y) < < (1 t) V(y,s λ (x))dρ(x,y)+t U Y U Y V(y,s (x))dρ(x,y). On the compement of U Y, we have V(y,s λ (x)) = V(y,s (x)), so that I[(1 t)s λ +ts ] < (1 t)i[s λ ]+ti[s ]. By the same ine of reasoning of item 1, one finds a contradiction. It foows that ν(u) = 0, that is, g(x) = 0 for ν-amost a x X. Two important exampes of convex oss functions which are not stricty convex are the hinge and the ε-insensitive oss. The next proposition deas with the hinge oss though a simiar resut can be aso derived for the ε-insensitive oss, see Burges and Crisp (2000). For the sake of simpicity we deveop our resut in the discrete setting for the case of constant offset functions. In this case uniqueness of the soution is expressed as a condition on the number of support vectors of the two casses. Simiar but a itte bit more invoved conditions can be found considering the continuous setting. Proposition 12 Let Y = {±1}, V(y,w) = 1 yw + and B = R. Let s λ be a soution of ( ) 1 min V(y i,s(x i ))+λ Qs 2 s S S, and define i=1 I + = {i y i = 1, s λ (x) < 1} I = {i y i = 1, s λ (x) > 1} B + = {i y i = 1, s λ (x 1 ) = 1} B = {i y i = 1, s λ (x 1 ) = 1}. The soution is unique if and ony if and where # denotes set cardinaity. #I + #I + #B (32) #I #I + + #B +, (33) Proof Assume that s is another soution. From item 1 of proposition 11, we have that Qs λ = Qs and s = s λ + b. Since both functions are minimizers, one concudes that i=1 1 y i s λ (x i ) + = i=1 1 y i s (x i ) + (34) 1383

22 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI We notice that if yw 1 < 1 and yw 2 > 1, then V(y,(1 t)w 1 +tw 2 ) < (1 t)v(y,w 1 )+tv(y,w 2 ). Reasoning as in the proof of the previous proposition, one has that, for a i I + I, and, for a i (I + I B + B ) y i s (x i ) 1 y i s (x i ) 1. Using the above two equations, it foows that equaity (34) becomes (1 y i s λ (x i )) = i I + I (1 y i s (x i ))+ by i +, i I + I i B + B (if the index set is empty, we et the corresponding sum be equa to 0). The above equation is equivaent to by i = by i +, i I + I i B + B that has a not trivia soution if and ony if both the foowing conditions are true 1. if b > 0, then i I+ I y i = B y i (that is, Eq. (32) hods). 2. if b < 0, then i I+ I y i = B+ y i (that is, Eq. (33) hods). Now, if neither Eq. (32) nor Eq. (33) hods, then b = 0 and s λ is unique. Conversey, assume for exampe that Eq. (32) hods. It is simpe to check that there is b > 0 such that for a i I + I, and, for a i (I + I B + B ) Finay, by direct computation one has that y i (s λ (x i )+b) 1 y i (s λ (x i )+b) 1. I[s λ ] = I[s λ + b]. If the soution is not unique, the soution famiy is parameterized as s λ +b, where b runs in a cosed, not necessariy bounded interva. However, if there is at east one exampe for each cass, b ies in the bounded interva [b,b + ] and one can easiy show that 1. for the soution with b = b, Eq. (32) hods; 2. for the soution with b = b +, Eq. (33) hods; 3. for the soution with b < b < b +, both Eqs. (32) and (33) hod, from which it foows that #I + = #I and #B + = #B =

23 SOME PROPERTIES OF REGULARIED KERNEL METHODS 7. Discrete Tikhonov Reguarization We now speciaize our resuts to the case in which the probabiity measure is the empirica distribution ρ S and B is the space of constant functions (K B = 1) and discuss in detai Support Vector Machines for cassification. We start by recaing that, from item 2 of Proposition 14 it foows that the eft and right derivatives of V(y, ) aways exist and ( V)(y,w) = [V (y,w),v +(y,w)]. Coroary 13 Let S = H +R and Q the projection on {s S s,1 S = 0}. Given λ > 0, et f λ H and b λ R and define s λ = f λ + b λ S, then { ( f λ,b λ 1 ) argmin f H,b R V(y i, f(x i )+b)+λ f 2 H i } if and ony if s λ argmin s S f λ = Qs λ { 1 i V(y i,s(x i ))+λ Qs 2 H } if and ony if there are α 1,...,α R such that f λ = i=1 α i K xi = i=1 α i (K xi + 1) 1 2λ V +(y i, f λ (x i )+b λ ) α i 1 2λ V (y i, f λ (x i )+b λ ) i=1 α i = 0 We notice two facts. First, α i can be zero ony if 0 ( V)(y i, f λ (x i )+b λ ) that is, ony if f λ (x i )+ b λ is a minimizer of V(y i, ). Therefore, a necessary condition for obtaining sparsity is a pateaux in the oss function. A quantitative discussion on this topic can be found in Steinwart (2003). Second if V and V + are bounded by a constant M > 0, one has that α i 2λM that is, a sufficient conditions for box constraints on the coefficients. In the rest of this section we consider Support Vector Machines for cassification showing that through our anaysis the soution is competey characterized in the prima formuation. A simpe cacuation for the hinge oss shows that [V (y,w),v +(y,w)] = y for yw < 1 [min{ y, 0}, max{0, y}] for yw = 1 0 for yw > (35)

24 DE VITO, ROSASCO, CAPONNETTO, PIANA AND VERRI To be consistent with the notation used in the iterature, we et C = 1 2λ and factorize the abes y i from the coefficients α i. Then, according to the above coroary, the soution of the SVM agorithm is given by s λ = i=1 α i y i K xi + b λ where the set (α 1,...,α,b λ ) soves the foowing agebraic system of inequaities ( ) 0 α i C if y i α j y j K(x i,x j )+b λ = 1 j=1 ( ) α i = 0 if y i α j y j K(x i,x j )+b λ > 1 (36) j=1 ( ) α i = C if y i α j y j K(x i,x j )+b λ < 1 j=1 α i y i = 0 i Interestingy, the above inequaities, which fuy characterize the support vectors associated with the soution, are usuay obtained as the Kuhn-Tucker conditions of the dua QP optimization probem (Vapnik, 1988). Looking at Eqs.(35-36), it is immediate to see that the box constraints (0 α i C) are due to the inearity of V(y f(x)) for y f(x) < 1, whereas sparsity (α i = 0) foows from the constancy of V(y f(x)) for y f(x) > Concusion In this paper we study some properties of earning functionas derived from Tikhonov reguarization. We deveop our anaysis in a continuous setting and use toos from convex anaysis in infinite dimensiona spaces to quantitativey characterize the expicit form of the reguarized soution for both regression and cassification. We aso address the case with and without the offset term within the same unifying framework. We show that the presence of an offset term is equivaent to soving a standard probem of reguarization in a Reproducing Kerne Hibert Space in which the penaty term is given by a seminorm. Finay, we discuss issues of existence and uniqueness of the soution and speciaize our resuts to the discrete setting. Current work aims at extending these resuts to vector-vaued functions (Micchei and Ponti, 2003) and exporing possibe use of offset functions to incorporate invariances (Girosi and Chan, 1995). Acknowedgments We thank the anonymous referees for suggestions eading to an improved version of the paper. A. Caponnetto is supported by a PRIN feowship within the project Inverse probems in medica imaging, n This research has been partiay funded by the INFM Project MAIA, 1386

25 SOME PROPERTIES OF REGULARIED KERNEL METHODS the FIRB Project ASTAA and the IST Programme of the European Community, under the PASCAL Network of Exceence, IST Appendix A. Convex Functions in Infinite Dimensiona Spaces The proof of Theorem 2 is based on the properties of convex functions defined on infinite dimensiona spaces. In particuar, we use the notion of subgradient that extends the notion of derivative to convex non-differentiabe functions. In this appendix we coect the resuts we need. For detais see the book Ekeand and Turnbu (1983) and aso Ekeand and Teman (1974). Let H be a Banach space and H its dua. A function F : H R + is convex if F(tv+(1 t)w) tf(v)+(1 t)f(w), for a v, w H and t [0, 1] (if the strict inequaity hods for t (0, 1), F is caed stricty convex). Let v 0 H such that F(v 0 ) < +. The subgradient of F at point v 0 H is the subset of H given by F(v 0 ) = {w H F(v) F(v 0 )+ w,v v 0, v H }. (37) where, is the pairing between H and H. If F(v) = +, we et F(v 0 ) = /0. In the foowing proposition we summarize the main properties of the subgradient we need. Proposition 14 The foowing facts hod: 1. If F is differentiabe at v 0, the subgradient reduces to the usua gradient F (v 0 ). 2. If F is defined on R and F(v 0 ) < +, then F admits eft and right derivative and F(v 0 ) = [F (v 0 ),F +(v 0 )]. 3. Assume that F +. A point v 0 is a minimizer of F if and ony if 0 F(v 0 ). 4. If F is continuous and im F(v) = +. v H + then F has a minimizer. If F is stricty convex, the minimizer is unique. 5. Let G be another convex function on H. Assume that there is v 0 H such that F and G are continuous and finite at v 0. Let a,b 0, then af + bg is convex and, for a v H, (af + bg)(v) = a( F)(v)+b( G)(v). 6. Let H be another Banach space and J be a continuous inear operator from H into H. Assume that there is v 0 H such that F is continuous and finite at J v 0. For a v H ( F J)(v ) = J ( F)(J v ), where J : H H is the adjoint of J defined by v,j v H = J v,v H. for a v H and v H. 1387

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO