SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming

Size: px

Start display at page:

Download "SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming"

Kristin Boyd
5 years ago
Views:

1 SVM Soft Margin Classifiers: Linear Prograing versus Quadratic Prograing Qiang Wu Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China Support vector achine soft argin classifiers are iportant learning algoriths for classification probles. They can be stated as convex optiization probles and are suitable for a large data setting. Linear prograing SVM classifier is specially efficient for very large size saples. But little is known about its convergence, copared with the well understood quadratic prograing SVM classifier. In this paper, we point out the difficulty and provide an error analysis. Our analysis shows that the convergence behavior of the linear prograing SVM is alost the sae as that of the quadratic prograing SVM. This is ipleented by setting a stepping stone between the linear prograing SVM and the classical 1 nor soft argin classifier. An upper bound for the isclassification error is presented for general probability distributions. Explicit learning rates are derived for deterinistic and weakly separable distributions, and for distributions satisfying soe Tsybakov noise condition. 1

2 1 Introduction Support vector achines SVM s) for an iportant subject in learning theory. They are very efficient for any applications, especially for classification probles. The classical SVM odel, the so-called 1 nor soft argin SVM, was introduced with polynoial kernels by Boser et al. 1992) and with general kernels by Cortes and Vapnik 1995). Since then any different fors of SVM algoriths were introduced for different purposes e.g. Niyogi and Girosi 1996; Vapnik 1998). Aong the the linear prograing LP) SVM Bradley and Mangasarian 2000; Kecan and Hadzic 2000; Niyogi and Girosi 1996; Pedroso and N. Murata 2001; Vapnik 1998) is an iportant one because of its linearity and flexibility for large data setting. The ter linear prograing eans the algorith is based on linear prograing optiization. Correspondingly, the 1 nor soft argin SVM is also called quadratic prograing QP) SVM since it is based on quadratic prograing optiization Vapnik 1998). Many experients deonstrate that LP-SVM is efficient and perfors even better than QP-SVM for soe purposes: capable of solving huge saple size probles Bradley and Mangasarian 2000), iproving the coputational speed Pedroso and N. Murata 2001), and reducing the nuber of support vectors Kecan and Hadzic 2000). While the convergence of QP-SVM has becoe pretty well understood because of recent works Steinwart 2002; Zhang 2004; Wu and Zhou 2003; Scovel and Steinwart 2003; Wu et al. 2004), little is known for LP-SVM. The purpose of this paper is to point out the ain difficulty and then provide error analysis for LP-SVM. Consider the binary classification setting. Let X, d) be a copact etric space and Y = 1, 1}. A binary classifier is a function f : X Y which labels every point x X with soe y Y. Both LP-SVM and QP-SVM considered here are kernel based classifiers. A function K : X X R is called a Mercer kernel if it is continuous, syetric and positive seidefinite, i.e., for any finite set of distinct points 2

3 x 1,, x l } X, the atrix Kx i, x j )) l i,j=1 is positive seidefinite. Let z = x 1, y 1 ),, x, y )} X Y ) be the saple. Motivated by reducing the nuber of support vectors of the 1 nor soft argin SVM, Vapnik 1998) introduced the LP-SVM algorith associated to a Mercer Kernel K. It is based on the following linear prograing optiization proble: in α R +, b R subject to 1 ξ i + 1 } α i C ) y i α j y j Kx i, x j ) + b 1 ξ i j=1 ξ i 0, i = 1,,. 1.1) Here α = α 1,, α ), ξ i s are slack variables. The trade-off paraeter C = C) > 0 depends on and is crucial. If ) α z = α 1,z,, α,z ), b z solves the optiization proble 1.1), the LP-SVM classifier is given by sgnf z ) with f z x) = α i,z y i Kx, x i ) + b z. 1.2) For a real-valued function f : X R, its sign function is defined as sgnf)x) = 1 if fx) 0 and sgnf)x) = 1 otherwise. The QP-SVM is based on a quadratic prograing optiization proble: in α R +, b R subject to 1 ξ i + 1 } 2 C α i y i Kx i, x j )α j y j i,j=1 ) y i α j y j Kx i, x j ) + b 1 ξ i j=1 ξ i 0, i = 1,,. 1.3) Here C = C) > 0 is also a trade-off paraeter depending on the saple size. If α z = α 1,z,, α,z ), b z ) solves the optiization proble 1.3), then the 1 nor soft argin classifier is defined by sgn f z ) with f z x) = α i,z y i Kx, x i ) + b z. 1.4) 3

4 Observe that both LP-SVM classifier 1.1) and QP-SVM classifier 1.3) are ipleented by convex optiization probles. Copared with this, neural network learning algoriths are often perfored by nonconvex optiization probles. The reproducing kernel property of Mercer kernels ensures nice approxiation power of SVM classifiers. Recall that the Reproducing Kernel Hilbert Space RKHS) H K associated with a Mercer kernel K is defined Aronszajn 1950) to be the closure of the linear span of the set of functions K x := Kx, ) : x X} with the inner product <, > K satisfying < K x, K y > K = Kx, y). The reproducing property is given by < f, K x > K = fx), x X, f H K. 1.5) The QP-SVM is well understood. It has attractive approxiation properties see 2.2) below) because the learning schee can be represented as a Tikhonov regularization Evgeniou et al. 2000) odified by an offset) associated with the RKHS: f z = arg in f=f +b H K +R 1 1 yi fx i ) ) C f 2 K }, 1.6) where t) + = ax 0, t}. Set H K := H K + R. For a function f = f 1 + b 1 H K, we denote f = f 1 and b f = b 1. Write b fz as b z. It turns out that 1.6) is the sae as 1.3) together with 1.4). To see this, we first note that f z ust lies in the span of K xi } according to the representation theore Wahba 1990). Next, the dual proble of 1.6) shows Vapnik 1998) that the coefficient of K xi, α i y i, has the sae sign as y i. Finally, the definition of the H K nor yields f z 2 K = α iy i K xi 2 K = i,j=1 α iy i Kx i, x j )α j y j. The rich knowledge on Tikhonov regularization schees and the idea of bias-variance trade-off developed in the neural network literature provide a atheatical foundation of the QP-SVM. In particular, the convergence is well understood due to the work done within the last a few years. Here the for 1.6) illustrate soe advantages of the QP-SVM: the iniization is 4

5 taken over the whole space H K, so we expect the QP-SVM has soe good approxiation power, siilar to the approxiation error of the space H K. Things are totally different for LP-SVM. Set } H K,z = α i y i Kx, x i ) : α = α 1,, α ) R +. Then the LP-SVM schee 1.1) can be written as f z = arg in f=f +b H K,z +R 1 1 yi fx i ) ) + 1 )} + C Ωf. 1.7) Here we have denoted Ωf ) = yα l 1 = α i for f = α iy i K xi with α i 0. It plays the role of a nor of f in soe sense. This is not a Hilbert space nor, which raises the technical difficulty for the atheatical analysis. More seriously, the hypothesis space H K,z depends on the saple z. The centers x i of the basis functions in H K,z are deterined by the saple z, not free. One ight consider regularization schees in the space of all linear cobinations with free centers, but whether the iniization can be reduced into a convex optiization proble of size, like 1.1), is unknown. Also, it is difficult to relate the corresponding optiu in a ball with radius C) to fz with respect to the estiation error. Thus separating the error for LP-SVM into two ters of saple error and approxiation error is not as iediate as for the QP-SVM or neural network ethods Niyogi and Girosi 1996) where the centers are free. In this paper, we shall overcoe this difficulty by setting a stepping stone. Turn to the error analysis. Let ρ be a Borel probability easure on Z := X Y and X, Y) be the corresponding rando variable. The prediction power of a classifier f is easured by its isclassification error, i.e., the probability of the event fx ) Y: Rf) = ProbfX ) Y} = P Y = fx) x) dρ X. 1.8) Here ρ X is the arginal distribution and ρ x) is the conditional distribution of ρ. The classifier iniizing the isclassification error is called the Bayes 5 X

6 rule f c. It takes the for 1, if P Y = 1 x) P Y = 1 x), f c x) = 1, if P Y = 1 x) < P Y = 1 x). If we define the regression function of ρ as f ρ x) = ydρy x) = P Y = 1 x) P Y = 1 x), x X, Y then f c = sgnf ρ ). Note that for a real-valued function f, sgnf) gives a classifier and its isclassification error will be denoted by Rf) for abbreviation. Though the Bayes rule exists, it can not be found directly since ρ is unknown. Instead, we have in hand a set of saples z = z i } = x 1, y 1 ),, x, y )} N). Throughout the paper we assue z 1,, z } are independently and identically distributed according to ρ. A classification algorith constructs a classifier f z based on z. Our goal is to understand how to choose the paraeter C = C) in the algorith 1.1) so that the LP-SVM classifier sgnf z ) can approxiate the Bayes rule f c with satisfactory convergence rates as ). Our approach provides clues to study learning algoriths with penalty functional different fro the RKHS nor Niyogi and Girosi 1996; Evgeniou et al. 2000). It can be extended to schees with general loss functions Rosasco et al. 2004; Lugosi and Vayatis 2004; Wu et al. 2004). 2 Main Results In this paper we investigate learning rates, the decay of the excess isclassification error Rf z ) Rf c ) as and C) becoe large. Consider the QP-SVM classification algorith f z defined by 1.3). Steinwart 2002) showed that R f z ) Rf c ) 0 as and C = C) ), when H K is dense in CX), the space of continuous functions on X with the nor. Lugosi and Vayatis 2004) found that for the exponential loss, the excess isclassification error of regularized boosting algoriths 6

7 can be estiated by the excess generalization error. An iportant result on the relation between the isclassification error and generalization error for a convex loss function is due to Zhang 2004). See Bartlett et al. 2003), and Chen et al. 2004) for extensions to general loss functions. Here we consider the hinge loss V y, fx)) = 1 yfx)) +. The generalization error is defined as Ef) = V y, fx)) dρ. Z Note that f c is a iniizer of Ef). Then Zhang s results asserts that Rf) Rf c ) Ef) Ef c ), f : X R. 2.1) Thus, the excess isclassification error R f z ) Rf c ) can be bounded by the excess generalization error E f z ) Ef c ), and the following error decoposition Wu and Zhou 2003) holds: E f z ) Ef c ) E ) ) fz Ez fz + Ez fk, C) E fk, C) } + D C). 2.2) Here E z f) = 1 V y i, fx i )). The function f K, C depends on C and is defined as f K,C := arg in Ef) + 1 } 2C f 2 K, C > ) f H K The decoposition 2.2) akes the error analysis for QP-SVM easy, siilar to that in Niyogi and Girosi 1996). The second ter of 2.2) easures the approxiation power of H K for ρ. by Definition 2.1. The regularization error of the syste K, ρ) is defined DC) := inf Ef) Ef c ) + 1 } f H K 2C f 2 K. 2.4) The regularization error for a regularizing function f K,C H K is defined as DC) := Ef K,C ) Ef c ) + 1 2C f K,C 2 K. 2.5) 7

8 In Wu and Zhou 2003) we showed that Ef) Ef c ) f f c L 1 ρx. Hence the regularization error can be estiated by the approxiation in a weighted L 1 space, as done in Sale and Zhou 2003), and Chen et al. 2004). Definition 2.2. We say that the probability easure ρ can be approxiated by H K with exponent 0 < β 1 if there exists a constant c β such that H1) DC) cβ C β, C > 0. The first ter of 2.2) is called the saple error. It has been well understood in learning theory by concentration inequalities, e.g. Vapnik 1998), Devroye et al. 1997), Niyogi 1998), Cucker and Sale 2001), Bousquet and Elisseeff 2002). The approaches developed in Barron 1990), Bartlett 1998), Niyogi and Girosi 1996), and Zhang 2004) separate the regularization error and the saple error concerning f z. In particular, for the QP-SVM, Zhang 2004) proved that E z Z E fz ) } inf Ef) + 1 } f H K 2 C f 2 K + 2 C. 2.6) It follows that E z Z E fz ) Ef c ) } 2 C D C)+. When H1) holds, Zhang s bound in connection with 2.1) yields E z Z R fz ) Rf c ) } = O C β )+ 2 C. This is siilar to soe well-known bounds for the neural network learning algoriths, see e.g. Theore 3.1 in Niyogi and Girosi 1996). The best learning rate derived fro 2.6) by choosing C = 1/β+1) is E z Z R fz ) Rf c ) } = O α), α = β β ) Observe that the saple error bound 2 C in 2.6) is independent of the kernel K or the distribution ρ. If soe inforation about K or ρ is available, the saple error and hence the excess isclassification error can be iproved. The inforation we need about K is the capacity easured by covering nubers. 8

9 Definition 2.3. Let F be a subset of a etric space. For any ε > 0, the covering nuber N F, ε) is defined to be the inial integer l N such that there exist l balls with radius ε covering F. In this paper we only use the unifor covering nuber. Covering nubers easured by epirical distances are also used in the literature van der Vaart and Wellner 1996). For coparisons, see Pontil 2003). Let B R = f H K : f K R}. It is a subset of CX) and the covering nuber is well defined. We denote the covering nuber of the unit ball B 1 as N ε) := N B 1, ε ), ε > ) Definition 2.4. The RKHS H K is said to have logarithic coplexity exponent s 1 if there exists a constant c s > 0 such that H2) log N ε) c s log1/ε) ) s. It has polynoial coplexity exponent s > 0 if there is soe c s > 0 such that H2 ) log N ε) c s 1/ε ) s. The unifor covering nuber has been extensively studied in learning theory. In particular, we know that for the Gaussian kernel Kx, y) = exp x y 2 /σ 2 } with σ > 0 on a bounded subset X of R n, H2) holds with s = n+1, see Zhou 2002); if K is C r with r > 0 Sobolev soothness), then H2 ) is valid with s = 2n/r, see Zhou 2003). The inforation we need about ρ is a Tsybakov noise condition Tsybakov 2004). Definition 2.5. Let 0 q. We say that ρ has Tsybakov noise exponent q if there exists a constant c q > 0 such that H3) P X x X : fρ x) c q t} ) t q. All distributions have at least noise exponent 0. Deterinistic distributions which satisfy f ρ x) 1) have the noise exponent q = with c = 1. 9

10 Using the above conditions about K and ρ, Scovel and Steinwart 2003) showed that when H1), H2 ) and H3) hold, for every ɛ > 0 and every δ > 0, with confidence 1 δ, R f z ) Rf c ) = O α), α = 4βq + 1) ɛ. 2.9) 2q + sq + 4)1 + β) When no conditions are assued for the distribution i.e., q = 0) or s = 2 for the kernel the worse case when epirical covering nubers are used, see van der Vaart and Wellner 1996), the rate is reduced to α = β ɛ, arbitrarily β+1 close to Zhang s rate 2.7). Recently, Wu et al. 2004) iprove the rate 2.9) and show that under the sae assuptions H1), H2 ) and H3), for every ɛ, δ > 0, with confidence 1 δ, R f z ) Rf c ) = O α), α = in βq + 1) βq + 2) + q + 1 β)s/2 ɛ, 2β }. β ) When soe condition is assued for the kernel but not for the distribution, i.e., s < 2 but q = 0, the rate 2.10) has power α = in β 2β ɛ, 2β+1 β)s/2 β+1}. This is better than 2.7) or 2.9) or the rates given in Bartlett et al. 2003; Blanchard et al. 2004, see Chen et al. 2004; Wu et al for detailed coparisons) if β < 1. This iproveent is possible due to the projection operator. Definition 2.6. The projection operator π is defined on the space of easurable functions f : X R as 1, if fx) > 1, πf)x) = 1, if fx) < 1, fx), if 1 fx) 1. The idea of projections appeared in argin-based bound analysis, e.g. Bartlett 1998), Lugosi and Vayatis 2004), Zhang 2002), Anthony and Bartlett 1999). We used the projection operator for the purpose of bounding isclassification and generalization errors in Chen et al. 2004). It helps 10

11 us to get sharper bounds of the saple error: probability inequalities are applied to rando variables involving functions π f z ) bounded by 1), not to f z the corresponding bound increases to infinity as C becoes large). In this paper we apply the projection operator to the LP-SVM. Turn to our ain goal, the LP-SVM classification algorith f z defined by 1.1). To our knowledge, the convergence of the algorith has not been verified, even for distributions strictly separable by a universal kernel. What is the ain difficulty in the error analysis? One difficulty lies in the error decoposition: nothing like 2.2) exists for LP-SVM in the literature. Bounds for the regularization or approxiation error independent of z are not available. We do not know whether it can be bounded by a nor in the whole space H K or a nor siilar to those in Niyogi and Girosi 1996). In the paper we overcoe the difficulty by eans of a stepping stone fro QP-SVM to LP-SVM. Then we can provide error analysis for general distributions. In particular, explicit learning rates will be presented. To this end, we first ake an error decoposition. Theore 1. Let C > 0, 0 < η 1 and f K,C H K. There holds Rf z ) Rf c ) 2ηRf c ) + S, C, η) + 2DηC), where S, C, η) is the saple error defined by } ) ) } S, C, η) := Eπf z )) E z πf z )) +1+η) E z fk,c E fk,c. 2.11) Theore 1 will be proved in Section 4. The ter DηC) is the regularization error Sale and Zhou 2004) defined for a regularizing function f K,C arbitrarily chosen) by 2.5). In Chen et al. 2004), we showed that where DC) DC) κ2 2C 2.12) κ := E 0 /1 + κ), κ = sup Kx, x), x X 11 E0 := inf Eb) Efc ) }. b R

12 Also, κ = 0 only for very special distributions. Hence the decay of DC) cannot be faster than O1/C) in general. Thus, to have satisfactory convergence rates, C can not be too sall, and it usually takes the for of τ for soe τ > 0. The constant κ is the nor of the inclusion H K CX): f κ f K, f H K. 2.13) Next we focus on analyzing the learning rates. Since a unifor rate is ipossible for all probability distributions as shown in Theore 7.2 of Devroye et al. 1997), we need to consider subclasses. The choice of η is iportant in the upper bound in Theore 1. If the distribution is deterinistic, i.e., Rf c ) = 0, we ay choose η = 1. When Rf c ) > 0, we ust choose η = η) 0 as in order to get the convergence rate. Of course the latter choice ay lead to a slightly worse rate. Thus, we will consider these two cases separately. The following proposition gives the bound for deterinistic distributions. Proposition 2.1. Suppose Rf c ) = 0. If f K,C is a function in H K satisfying V y, f K,C x)) M, then for every 0 < δ < 1, with confidence 1 δ there holds Rf z ) 32ε,C + 20M log2/δ) 3 + 8DC), where with a constant c s depending on c s, κ and s, ε,c is given by 22 log 2 [ δ + c s log CM log 2 ) )] } s + log CDC), if H2) holds; δ 35 log 2/δ ) 1 + c s) ) 1 s 1+s CM) 1+s + 32c s CDC)) s 1+s, if H2 ) holds. 3 1/1+s) Proposition 2.1 will be proved in Section 6. As corollaries we obtain learning rates for strictly separable distributions and for weakly separable distributions. Definition 2.7. We say that ρ is strictly separable by H K with argin γ > 0 if there is soe function f γ H K such that f γ K = 1 and yf γ x) γ alost everywhere. 12

13 For QP-SVM, the strictly separable case is well understood, see e.g. Vapnik 1998), Cristianini and Shawe-Taylor 2000) and vast references therein. For LP-SVM, we have Corollary 2.1. If ρ is strictly separable by H K with argin γ > 0 and H2) holds, then Rf z ) 704 log 2 δ + c s log + log 1 ) } s + 4 γ 2 Cγ. 2 ) In particular, this will yield the learning rate O log ) s by taking C = /γ 2. Proof. Take f K,C = f γ /γ. Then V y, f K,C x)) 0 and DC) equals 1 f 2C γ /γ 2 K = 1. The conclusion follows fro Proposition 2.1 by choosing 2Cγ 2 M = 0. Reark 2.1. For strictly separable distributions, we verify the optial rate when H2) holds. Siilar rates are true for ore general kernels. But we oit details here. Definition 2.8. We say that ρ is weakly) separable by H K if there is soe function fsp H K, called the separating function, such that f sp K = 1 and yfspx) > 0 alost everywhere. It has separating exponent θ 0, ] if for soe γ θ > 0, there holds ρ X 0 < f sp x) < γ θ t) t θ. 2.14) Corollary 2.2. Suppose that ρ is separable by H K with 2.14) valid. i) If H2) holds, then log + log C) s Rf z ) = O ) + C θ θ+2. log )s This gives the learning rate O ) by taking C = θ+2)/θ. 13

14 ii) If H2 ) holds, then Rf z ) = O C s 1+s 2s θ+2 C + ) ) 1 1+s + C θ θ+2. This yields the learning rate O θ θ+2 sθ+2s+θ ) by taking C = sθ+2s+θ. Proof. Take f K,C = C 1 θ+2 f sp/γ θ. By the definition of fsp, we have yf K,C x) 0 alost everywhere. Hence 0 V y, f K,C x)) 1. Moreover, Ef K,C ) = X 1 C 1 θ+2 γ θ ) } fspx) dρ X = ρ X 0 < fspx) < γ θ C 1 θ+2 + which is bounded by C θ θ+2. Therefore, DC) )C θ 2γθ 2 θ+2. Then the conclusion follows fro Proposition 2.1 by choosing M = 1. Exaple. Let X = [ 1/2, 1/2] and ρ be the Borel probability easure on Z such that ρ X is the Lebesgue easure on X and 1, if 1/2 x < 0, f ρ x) = 1, if 0 < x < 1/2. If we take the linear kernel Kx, y) = x y, then θ = 1, γ θ = 1/2. Since H2) is satisfied with s = 1, the learning rate is O log ) by taking C = 3. Reark 2.2. The condition 2.14) with θ = is exactly the definition of strictly separable distribution and γ θ is the argin. The choice of f K,C and the regularization error play essential roles to get our error bounds. It influences the strategy of choosing the regularization paraeter odel selection) and deterines learning rates. For weakly separable distributions we chose f K,C to be ultiples of a separating function in Corollary 2.2. For the general case, it can be the choice 2.3). Let s analyze learning rates for distributions having polynoially decaying regularization error, i.e., H1) with β 1. This is reasonable because of 2.12). 14

15 Theore 2. Suppose that Rf c ) = 0 and the hypotheses H1), H2 ) hold with 0 < s < and 0 < β 1, respectively. Take C = ζ with ζ := in 1, 2 }. Then for every 0 < δ < 1 there exists a constant c s+β 1+β depending on s, β, δ such that with confidence 1 δ, } 2β Rf z ) c α, α = in 1 + β, β. s + β Next we consider general distributions satisfying Tsybakov condition Tsybakov 2004). Theore 3. Assue the hypotheses H1), H2 ) and H3) with 0 < s <, 0 < β 1, and 0 q. Take C = ζ with } 2 ζ := in β + 1, q + 1)β + 1). sq + 1) + βq qs + s) For every ɛ > 0 and every 0 < δ < 1 there exists a constant c depending on s, q, β, δ, and ɛ such that with confidence 1 δ, } 2β Rf z ) Rf c ) c α, α = in β + 1, βq + 1) sq + 1) + βq qs + s) ɛ. Reark 2.3. Since Rf c ) is usually sall for a eaningful classification proble, the upper bound in Theore 1 tells that the perforance of LP- SVM is siilar to that of QP-SVM. However, to have convergence rates, we need to choose η = η) 0 as becoes large. This akes our rate worse than that of QP-SVM. This is the case when the capacity index s is large. When s is very sall, the rate is O α ) with α close to in q+1, 2β }, q+2 β+1 which coincides to the rate 2.10), and is better than the rates 2.7) or 2.9) for QP-SVM. As any C kernel satisfies H2 ) for an arbitrarily sall s > 0 Zhou 2003), this is the case for polynoial or Gaussian kernels, usually used in practice. Reark 2.4. Here we use a stepping stone fro QP-SVM to LP-SVM. So the derived learning rates for the LP-SVM are essentially no worse than those of QP-SVM. It would be interesting to introduce different tools to get 15

16 learning rates for the LP-SVM, better than those of QP-SVM. Also, the choice of the trade-off paraeter C in Theore 3 depends on the indices β approxiation), s capacity), and q noise condition). This gives a rate which is optial by our approach. One can take other choices ζ > 0 for C = ζ ), independent of β, s, q, and then derive learning rates according to the proof of Theore 3. But the derived rates are worse than the one stated in Theore 3. It would be of iportance to give soe ethods for choosing C adaptively. Reark 2.5. When epirical covering nubers are used, the capacity index can be restricted to s [0, 2]. Siilar learning rates can be derived, as done in Blanchard et al. 2004), Wu et al. 2004). 3 Stepping Stone Recall that in 1.7), the penalty ter Ωf ) is usually not a nor. This akes the schee difficult to analyze. Since the solution f z of the LP-SVM has a representation siilar to f z in QP-SVM, we expect close relations between these schees. Hence the latter ay play roles in the analysis for the forer. To this end, we need to estiate Ω f z ), the l 1 nor of the coefficients of the solution f z to 1.4). Lea 3.1. For every C > 0, the function f z defined by 1.3) and 1.4) satisfies Ω f z ) = α i,z CE z f z ) + f z 2 K. Proof. The dual proble of the 1 nor soft argin SVM Vapnik 1998) tells us that the coefficients α i,z in the expression 1.4) of f z satisfy 0 α i,z C and α i,z y i = ) The definition of the loss function V iplies that 1 y i fz x i ) V y i, f z x i ) ). 16

17 Then α i,z α i,z y i fz x i ) α i,z V y i, f z x i ) ). Applying the upper bound for α i,z in 3.1), we can bound the right side above as α i,z V y i, f z x i ) ) C V y i, f z x i ) ) = CE ) z fz. Applying the second relation in 3.1) yields It follows that α i,z y i bz = 0. α i,z y i fz x i ) = α i,z y i f z x i ) + b ) z = α i,z y i f z x i ). But f z x i ) = α j,z y j Kx i, x j ). We have j=1 α i,z y i fz x i ) = α i,z y i α j,z y j Kx i, x j ) = f z 2 K. i,j=1 Hence the bound for Ω f z ) follows. Reark 3.1. Dr.Yiing Ying pointed out to us that actually the equality holds in Lea 3.1. This follows fro the KKT conditions. But we only need the inequality here. 4 Error Decoposition In this section, we estiate Rf z ) Rf c ). 17

18 Since sgnπf)) = sgnf), we have Rf) = Rπf)). Using 2.1) to πf), we obtain Rf) Rf c ) = Rπf)) Rf c ) Eπf)) Ef c ). 4.1) It is easy to see that V y, πf)x)) V y, fx)). Hence Eπf)) Ef) and E z πf)) E z f). 4.2) We are in a position to prove Theore 1 which, by 4.1), is an easy consequence of the following result. Proposition 4.1. Let C > 0, 0 < η 1 and f K,C H K. Then Eπf z )) Ef c ) + 1 C Ωf z ) 2ηRf c ) + S, C, η) + 2DηC), where S, C, η) is defined by 2.11). Proof. Take f z to be the solution of 1.4) with C = ηc. We see fro the definition of f z and 4.2) that E z πfz ) ) + 1 ) C Ωf z ) E z f z ) + 1 C Ω f ) z ) 0. This enables us to decopose Eπf z )) + 1 C Ωf z ) as Eπf z )) + 1 C Ωf z ) E πf z ) ) E z πfz ) )} + E z f z ) + 1 C Ω f ) z ). Lea 3.1 gives Ω f z ) CE z f z ) + f z 2 K. But C = ηc. Hence Eπf z )) + 1 C Ωf z ) E πf z ) ) E z πfz ) )} η)e z f z ) + 1 C f z 2 K. Next we use the function f K,C to analyze the second ter of the above bound and get E z f z ) η)c f z 2 K E z f z ) C f z 2 K E z f K,C ) C f K,C 2 K. This bound can be written as E z f K,C ) Ef K,C ) } + Ef K,C )+ 1 2 C f K,C 2 K 18 }.

19 Cobining the above two steps, we find that Eπf z )) Ef c ) + 1 C Ωf z ) is bounded by E πf z ) ) E z πfz ) )} ) ) η) E } z fk,c E fk,c +1 + η) E ) ) } f K,C E fc + 1 f 2ηC K,C 2 K + ηef c ). By the fact Ef c ) = 2Rf c ) and the definition of DC), we draw our conclusion. 5 Probability Inequalities In this section we give soe probability inequalities. They odify the Bernstein inequality and extend our previous work in Chen et al. 2004) which was otivated by saple error estiates for the square loss e.g. Barron 1990; Bartlett 1998; Cucker and Sale 2001, and Mendelson 2002). Recall the Bernstein inequality: Let ξ be a rando variable on Z with ean µ and variance σ 2. If ξ µ M, then Prob µ 1 } ξz i ) > ε 2 exp ε 2 2σ 2 + 1Mε) 3 The one-side Bernstein inequality holds without the leading factor 2. Proposition 5.1. Let ξ be a rando variable on Z satisfying µ 0, ξ µ M alost everywhere, and σ 2 cµ τ for soe 0 τ 2. Then for every ε > 0 there holds µ 1 Prob ξz } i) > ε 1 τ µ τ + ε τ ) exp ε 2 τ }. 2c Mε1 τ ) Proof. The one-side Bernstein inequality tells us that µ 1 Prob ξz } } i) > ε 1 τ ε 2 τ µ τ + ε τ ) µ τ + ε τ ) 1 2 exp 2 2 ). σ Mε1 τ 2 µ τ + ε τ ) }.

20 Since σ 2 cµ τ, we have σ 2 + M 3 ε1 τ 2 µ τ + ε τ ) 1 2 cµ τ + M 3 ε1 τ µ τ + ε τ ) µ τ + ε τ ) c Mε1 τ). This yields the desired inequality. Note that f z depends on z and thus runs over a set of functions as z changes. We need a probability inequality concerning the unifor convergence. Denote E g := Z gz)dρ. Lea 5.1. Let 0 τ 1, M > 0, c 0, and G be a set of functions on Z such that for every g G, E g 0, g E g M and E g 2 ce g) τ. Then for ε > 0, Prob E g 1 sup gz i) g G E g) τ + ε τ ) 1 2 > 4ε 1 τ 2 } N G, ε ) exp ε 2 τ 2c Mε1 τ ) Proof. Let g j } N j=1 G with N = N G, ε ) such that for every g G there is soe j 1,..., N } satisfying g g j ε. Then by Proposition 5.1, a standard procedure Cucker and Sale 2001; Mukherjee et al 2002; Chen et al. 2004) leads to the conclusion. Reark 5.1. Various fors of probability inequalities using epirical covering nubers can be found in the literature. For siplicity we give the current for in Lea 5.1 which is enough for our purpose. Let us find the hypothesis space covering f z when z runs over all possible saples. This is ipleented in the following two leas. By the idea of bounding the offset fro Wu and Zhou 2003), and Chen et al. 2004), we can prove the following. Lea 5.2. For any C > 0, N and z Z, we can find a solution f z of 1.7) satisfying in 1 i f zx i ) 1. Hence b z 1 + f z. We shall always choose f z as in Lea 5.2. In fact, the only restriction we need to ake for the iniizer f z is to choose α i = 0 and b z = y, i.e., f z x) = y whenever y i = y for all 1 i with soe y Y. 20 }.

21 Lea 5.3. For every C > 0, we have f z H K and f z K κωf z ) κc. Proof. It is trivial that f z H K. By the reproducing property 1.5), 1/2 ) 1/2 fz K = α i,z α j,z y i y j Kx i, x j )) κ α i,z α j,z = κωfz ). i,j=1 i,j=1 Bounding the solution to 1.7) by the choice f = 0 + 0, we have Ef z ) + 1 Ωf C z ) E0) + 0 = 1. This gives Ωfz ) C, and copletes the proof. By Lea 5.3 and Lea 5.2 we know that πf z ) lies in F R := πf) : f B R + [ 1 + κr), 1 + κr ]} 5.1) with R = κc. The following lea Chen et al. 2004) gives the covering nuber estiate for F R. Lea 5.4. Let F R be given by 5.1) with R > 0. For any ε > 0 there holds ) 21 + κr) ε ) N F R, ε) + 1 N. ε 2R Using the function set F R defined by 5.1), we set for R > 0, G R = V y, fx) ) } V y, f c x)) : f F R. 5.2) By Lea 5.4 and the additive property of the log function, we have Lea 5.5. Let G R given by 5.2) with R > 0. i) If H2) holds, then there exists a constant c s > 0 such that log N G R, ε) c s log R ε ) s. ii) If H2 ) holds, then there exists a constant c s > 0 such that R log N G R, ε) c s ε ) s. 21

22 The following lea was proved by Scovel and Steinwart 2003) for general functions f : X R. With the projection, here f has range [ 1, 1] and a sipler proof is given. Lea 5.6. Assue H3). For every function f : X [ 1, 1] there holds ) } 2 E V y, fx)) V y, f c x)) 8 1 ) ) q q/q+1) q+1 Ef) Ef c ). 2c q Proof. Since fx) [ 1, 1], we have V y, fx)) V y, f c x)) = yf c x) fx)). It follows that Ef) Ef c ) = f c x) fx))f ρ x)dρ X = f c x) fx) f ρ x) dρ X X and ) } 2 E V y, fx)) V y, f c x)) = f c x) fx) 2 dρ X. X Let t > 0 and separate the doain X into two sets: X t + := x X : f ρ x) > c q t} and Xt := x X : f ρ x) c q t}. On X t + we have f c x) fx) 2 2 f c x) fx) fρx) c qt. On Xt we have f c x) fx) 2 4. It follows fro Assuption H3) that f c x) fx) 2 dρ X 2 Ef) Ef c ) ) +4ρ X Xt ) 2 Ef) Ef c ) ) +4t q. c q t c q t X X Choosing t = Ef) Ef c ))/2c q ) } 1/q+1) yields the desired bound. Take the function set G in Lea 5.1 to be G R. Then a function g in G R takes the for gx, y) = V y, πf)x) ) V y, f c x)) with πf) F R. Obviously we have g 2, E g = Eπf)) Ef c ) and 1 gz i) = E z πf)) E z f c ). When Assuption H3) is valid, Lea 5.6 tells us that E g 2 c E g ) τ with τ = q and c = 8 ) 1 q/q+1). q+1 2c q Applying Lea 5.1 and solving the equation log N G R, ε ) ε 2 τ 2c ε1 τ ) = log δ, we see the following corollary fro Lea 5.5 and Lea

23 Corollary 5.1. Let G R be defined by 5.2) with R > 0 and H3) hold with 0 q. For every 0 < δ < 1, with confidence at least 1 δ, there holds } } Ef) Ef c ) E z f) E z f c ) 4ε,R + 4ε for all f F R, where ε,r is given by 5 8 ) ) 1 q/q+1) log 1 + 2c q + 1 δ c slog R + log ) s ) ) 1 q/q+1) 2c q + 1 c s R s ) q+1) q+2+qs+s log 1 δ Rate Analysis q+2 2q+1),R ) q+1 q+2 ) q+1 q+2 } q 2q+1) Ef) Ef c ), if H2) holds, ), if H2 ) holds. Let us now prove the ain results stated in Section 2. We first prove Proposition 2.1. Proof of Proposition 2.1. Since Rf c ) = 0, V y, f c x)) = 0 alost everywhere and Ef c ) = 0. Take η = 1 in Proposition 4.1. We first consider the rando variable ξ = V y, f K,C x)). Since 0 ξ M and E ξ = Ef K,C ) DC), we have σ 2 ξ) E ξ 2 M E ξ MDC). Applying the one-side Bernstein inequality to ξ, we see by solving the quadratic equation ε2 = log δ/2 ) that with probability 1 δ/2, 2σ 2 +Mε/3) E z f K,C ) Ef K,C ) 2M log ) 2 δ + 3 2σ 2 ξ) log 2/δ ) 5M log ) 2 δ + DC) ) Next we estiate Eπf z )) E z πf z )). By the definition of f z, there holds 1 C Ωf z ) E z f z ) + 1 C Ωf z ) E z f z ) + 1 C Ω f z ). 23

24 According to Lea 3.1, this is bounded by 2 E z f z ) + 1 connection with the definition of f z yields 2C f z 2 K 1 C Ωf z ) 2 E z f z ) + 1 2C f ) z 2 K 2 E z f K,C ) + 1 ) 2C f K,C 2 K. Since Ef c ) = 0, DC) = Ef K,C ) + 1 2C f K,C 2 K. It follows that 1 ) C Ωf z ) 2 E z f K,C ) Ef K,C ) + DC). ). This in Together with Lea 5.3 and 6.1), this tells us that with probability 1 δ/2 ) 5M log 2/δ ) fz K κωfz ) R := 2κC + 2DC). 3 As we are considering a deterinistic case, H3) holds with q = and c = 1. Recall the definition of G R in 5.2). Corollary 5.1 with q = and R given as above iplies that Eπf z )) E z πf z )) 4ε,C + 4 ε,c Eπfz )) with confidence 1 δ where ε,c is defined in the stateent. Putting the above two estiates into Proposition 4.1, we have with confidence 1 δ, Eπf z )) 4ε,C + 4 ε,c Eπfz )) + 10M log 2/δ ) + 4DC). 3 Solving the quadratic inequality for Eπf z )) leads to Eπf z )) 32ε,C + 20M log 2/δ ) + 8DC). 3 Then our conclusion follows fro 4.1). Finally, we turn to the proof of Theores 2 and 3. To this end, we need a bound for f K,C K. According to the definition, 1 2C f K,C 2 K DC). Then we have 24

25 Lea 6.1. For every C > 0, there hold f K,C K DC)) 1/2 2C and f DC)) 1/2. K,C 1 + 2κ 2C Proof of Theore 2. Take f K,C = f K,C in Proposition 4.1. Then by DC)) 1/2. Lea 6.1 we ay take M = 2 + 2κ 2C Proposition 2.1 with Assuption H2 ) yields Rf z ) c s,β,δ C 1 β)s/s+1) 1 1+s + C1 β)s/s+1) 1 1+s Take C = in 1 s+β, 2 1+β }. Then C 1+β)/2 C 1+β)/2 ) s 1+s + C 1 β } 2 + C β. 1 and the proof is coplete. Proof of Theore 3. Denote z = E πf z ) ) Ef c ) + 1 C Ωf z ). Then we have Ωf z ) C z. This in connection with Lea 5.3 yields f z K κωf z ) κc z. 6.2) Take f K,C = f K, C with C = ηc in Proposition 4.1. It tells us that z 2ηRf c ) + S, C, η) + 2 DηC). Set η = C β/β+1). Then C = ηc = C 1/β+1). By the fact Rf c ) 1 2 and Assuption H1), z S, C, η) c β )C β β ) Recall the expression 2.11) for S, C, η). Here f K,C = f K, C. So we have S, C, η) = E πf z ) ) ) Ef c ) E z πfz ) ) )} E z f c ) ) +1 + η) E z fk, C) Ez f c ) E )} fk, C) Efc ) } +η E z f c ) Ef c ) =: S η)s 2 + ηs 3. Take t 1, C 1 to be deterined later. For R 1, denote WR) := z Z : f z K R}. 6.4) 25

26 For S 1, we apply Corollary 5.1 with δ = e t 1/e. We know that there Z of easure at ost δ = e t such that is a set V 1) R R s ) q+1 q+2+qs+s R s S 1 c s,q t + ) q+1 q+2+qs+s q+2 2q+1) q 2q+1) z }, z WR)\V 1) R. Here c s,q := 32 8 ) 1 q/q+1) 2c q + 1 3) c s + 1) 1 is a constant depending only on q and s. To estiate S 2, consider ξ = V y, f Cx) ) K, V y, f c x)) on Z, ρ). By Lea 6.1, we have f K, C 1 + 2κ 2 C D C) 1 + 2κ 2c β C 1 β 2β+1). Write ξ = ξ 1 + ξ 2 where ξ 1 := V y, f K, Cx) ) V y, π f K, C)x) ), ξ 2 := V y, π f K, C)x) ) V y, f c x)). It is easy to check that 0 ξ 1 2κ 2c β C 1 β 2β+1). Hence σ 2 ξ 1 ) is bounded by 2κ 2c β C 1 β 2β+1) E ξ1. Then the one-side Bernstein inequality with δ = e t tells us that there is a set V 2) Z of easure at ost δ = e t such that for every z Z \ V 2), there holds 1 ξ 1 z i ) E ξ 1 4κ 2c β C 1 β 3 2β+1) t + 2σ2 ξ 1 )t 10κ 2cβ C 1 β 3 2β+1) t +E ξ 1. For ξ 2, by Lea 5.6, σ 2 ξ 2 ) 8 1 2c q ) q/q+1) E ξ 2 ) q q+1. But ξ 2 2. So the one-side Bernstein inequality tells us again that there is a set V 3) Z of easure at ost δ = e t such that for every z Z \V 3), there holds 1 ξ 1 z i ) E ξ 1 4t 3 + 4σ2 ξ 1 )t 26 4t ) q q+2 t q+2 + E ξ 2. 2c q )q+1

27 Here we have used the following eleentary inequality with b := E ξ 2 ) q 2q+2 and a := c q ) q/q+1) t/ ) 1/2 : a b q + 2 2q + 2 a2q+2)/q+2) + q 2q + 2 b2q+2)/q, a, b > 0. Cobing the two estiates for ξ 1, ξ 2 with the fact that E ξ = E ξ 1 +E ξ 2 = E fk, C) Efc ) D C) c β C β/β+1) we see that S 2 c q,β t C 1 β 2β+1) + 1 ) ) q+1 q+2 + C β β+1, z Z \ V 2) R \ V 3) R, where c q,β := 10κ 2c β / c q ) q/q+1) + c β is a constant depending on q, β. The last ter is S 3 1. Putting the above three estiates for S 1, S 2, S 3 to 6.3), we find that for every z WR) \ V 1) R \ V 2) \ V 3) there holds R s 1 z 2c s,q t ) q+1) q+2+qs+s + 8c q,β t ) q+1 q+2 + C β β+1 C 1/2 )} ) Here we have used another eleentary inequality for α = q/2q + 2) 0, 1) and x = z : x ax α + b, a, b, x > 0 = x ax2a) 1/1 α), 2b}. Now we can choose C to be } C := in 2, q+1)β+1) sq+1)+βq+2+qs+s). 6.6) It ensures that ) q+1 1 q+2 C β β+1 and 1 ) q+1) q+2+qs+s C sq+1)+βq+2+qs+s) β+1)q+2+qs+s). With this choice of C, 6.5) iplies that with a set V R := V 1) R V 2) R V 3) R easure at ost 3e t, ) z C β β+1 2c s,q t C 1 sq+1) } q+2+qs+s β+1 R + 24c q,β t, z WR) \ V R. 6.7) We shall finish our proof by using 6.2) and 6.7) iteratively. 27 of

28 Start with the bound R = R 0) := κc. Lea 5.3 verifies WR 0) ) = Z. At this first step, by 6.7) and 6.2) we have Z = WR 0) ) WR 1) ) V R 0), where R 1) := κc 1 β+1 2c s,q tκ + 1)) C β β+1 } sq+1) q+2+qs+s + 24cq,β t. Now we iterate. For n = 2, 3,, we derive fro 6.7) and 6.2) that Z = WR 0) ) WR 1) ) V R 0) WR n) ) n 1 j=0 V R j) ), where each set V R j) has easure at ost 3e t and the nuber R n) is given by } R n) = κc 1 β+1 2c s,q tκ + 1)) n C β sq+1) β+1 q+2+qs+s) n + 24c q,β tκ + 1)n. Note that ɛ > 0 is fixed. We choose n 0 N to be large enough such that ) n0 +1) sq + 1) ɛ s + 2s q qs + s β + q + 2 ). q + 1 In the n 0 -th step of our iteration we have shown that for z WR n0) ), fz K κc 1 β+1 2c s,q tκ + 1)) n 0 C β sq+1) β+1 q+2+qs+s) n cq,β tκ + 1)n 0 }. This together with 6.5) gives z cs, q, β, ɛ)t n 0 ax 2β β+1, βq+1) +ɛ} sq+1)+βs+2+qs+s). This is true for z WR n0) ) \ V R n 0 ). Since the set n 0 j=0 V R j) has easure at ost 3n 0 +1)e t, we know that the set WR n0) )\V R n 0 ) has easure at least 1 3n 0 + 1)e t. Note that E πf z ) ) Ef c ) z. Take t = log 3n 0+1) ). δ Then the proof is finished by 4.1). Acknowledgents This work is partially supported by the Research Grants Council of Hong Kong [Project No. CityU ] and by City University of Hong Kong [Project No ]. The corresponding author is Ding-Xuan Zhou. 28

29 References Anthony, M., and Bartlett, P. L. 1999). Neural Network Learning: Theoretical Foundations. Cabridge University Press. Aronszajn, N. 1950). Theory of reproducing kernels. Trans. Aer. Math. Soc., 68, Barron, A. R. 1990). Coplexity regularization with applications to artificial neural networks. In Nonparaetric Functional Estiation G. Roussa, ed.), Dortrecht: Kluwer. Bartlett, P. L. 1998). The saple coplexity of pattern classification with neural networks: the size of the weights is ore iportant than the size of the network. IEEE Trans. Infor. Theory, 44, Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. 2003). Convexity, classification, and risk bounds. Preprint. Blanchard, B., Bousquet, O., and Massart, P. 2004). Statistical perforance of support vector achines. Preprint. Boser, B. E., Guyon, I., and Vapnik, V. 1992). A training algorith for optial argin classifiers. In Proceedings of the Fifth Annual Workshop of Coputational Learning Theory, Vol. 5, Pittsburgh: ACM. Bousquet, O., and Elisseeff, A. 2002). Stability and generalization. J. Machine Learning Research, 2, Bradley, P. S., and Mangasarian, O. L. 2000). Massive data discriination via linear support vector achines. Optiization Methods and Software, 13, Chen, D. R., Wu, Q., Ying, Y., and Zhou, D. X. 2004). Support vector achine soft argin classifiers: error analysis. J. Machine Learning Research, 5, Cortes, C., and Vapnik, V. 1995). Support-vector networks. Mach. Learning, 20, Cristianini, N., and Shawe-Taylor, J. 2000). An Introduction to Support Vector Machines. Cabridge University Press. 29

30 Cucker, F., and Sale, S. 2001). On the atheatical foundations of learning. Bull. Aer. Math. Soc., 39, Devroye, L., Györfi, L., and Lugosi, G. 1997). A probabilistic Theory of Pattern Recognition. New York: Springer-Verlag. Evgeniou, T., Pontil, M., and Poggio, T. 2000). Regularization networks and support vector achines. Adv. Coput. Math., 13, Kecan, V., and Hadzic, I. 2000). Support vector selection by linear prograing. Proc. of IJCNN, 5, Lugosi, G., and Vayatis, N. 2004). On the Bayes-risk consistency of regularized bossting ethods. Ann. Statis., 32, Mendelson, S. 2002). Iproving the saple coplexity using global data. IEEE Trans. Infor. Theory, 48, Mukherjee, S., Rifkin, R., and Poggio, T. 2002). Regression and classification with regularization. In Lecture Notes in Statistics: Nonlinear Estiation and Classification, D. D. Denison, M. H. Hansen, C. C. Holes, B. Mallick, and B. Yu eds.), New York: Springer-Verlag. Niyogi, P. 1998). The Inforational Coplexity of Learning. Kluwer. Niyogi, P., and Girosi, F. 1996). On the relationship between generalization error, hypothesis coplexity, and saple coplexity for radial basis functions. Neural Cop., 8, Pedroso, J. P., and Murata, N. 2001). Support vector achines with different nors: otivation, forulations and results. Pattern recognition Letters, 22, Pontil, M. 2003). A note on different covering nubers in learning theory. J. Coplexity, 19, Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., and Verri, A. 2004). Are loss functions all the sae? Neural Cop., 16, Scovel, C., and Steinwart, I. 2003). Fast rates for support vector achines. Preprint. Sale, S., and Zhou, D. X. 2003). Estiating the approxiation error in learning theory. Anal. Appl., 1,

31 Sale, S., and Zhou, D. X. 2004). Shannon sapling and function reconstruction fro point values. Bull. Aer. Math. Soc., 41, Steinwart, I. 2002). Support vector achines are universally consistent. J. Coplexity, 18, Tsybakov, A. B. 2004). Optial aggregation of classifiers in statistical learning. Ann. Statis., 32, van der Vaart, A. W., and Wellner, J. A. 1996). Weak Convergence and Epirical Processes. New York: Springer-Verlag. Vapnik, V. 1998). Statistical Learning Theory. John Wiley & Sons. Wahba, G. 1990). Spline Models for Observational Data. SIAM. Wu, Q., Ying, Y., and Zhou, D. X. 2004). Multi-kernel regularized classifiers. Preprint. Wu, Q., and Zhou, D. X. 2004). Analysis of support vector achine classification. Preprint. Zhang, T. 2004). Statistical behavior and consistency of classification ethods based on convex risk iniization. Ann. Statis., 32, Zhang, T. 2002). Covering nuber bounds of certain regularized linear function classes. J. Machine Learning Research, 2, Zhou, D. X. 2002). The covering nuber in learning theory. J. Coplexity, 18, Zhou, D. X. 2003). Capacity of reproducing kernel spaces in learning theory. IEEE Trans. Infor. Theory, 49,

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007