Multi-kernel Regularized Classifiers

Size: px

Start display at page:

Download "Multi-kernel Regularized Classifiers"

Miles Cobb
6 years ago
Views:

1 Multi-kernel Regularized Classifiers Qiang Wu, Yiing Ying, and Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong Kowloon, Hong Kong, CHINA Abstract A faily of classification algoriths generated fro Tikhonov regularization schees are considered. They involve ulti-kernel spaces and general convex loss functions. Our ain purpose is to provide satisfactory estiates for the excess isclassification error of these ulti-kernel regularized classifiers. The error analysis consists of two parts: regularization error and saple error. Allowing ulti-kernels in the algorith iproves the regularization error and approxiation error, which is one advantage of the ulti-kernel setting. For a general loss function, we show how to bound the regularization error by the approxiation in soe weighted L q spaces. For the saple error, we use a projection operator. The projection in connection with the decay of the regularization error enables us to iprove convergence rates in the literature even for the one kernel schees and special loss functions: least square loss and hinge loss for support vector achine soft argin classifiers. Existence of the optiization proble for the regularization schee associated with ulti-kernels is verified when the kernel functions are continuous with respect to the index set. Gaussian kernels with flexible variances and probability distributions with soe noise conditions are deonstrated to illustrate the general theory. Keywords and Phrases: Classification algorith, ulti-kernel regularization schee, convex loss function, isclassification error, regularization error and saple error. Supported by the Research Grants Council of Hong Kong [Project No. CityU ]. Corresponding author: Ding-Xuan Zhou. 1

2 1. Introduction We study binary classification algoriths generated fro Tikhonov regularization schees associated with general convex loss functions and ulti-kernel spaces. These algoriths produce binary classifiers f : X 1, 1, fro a copact etric space X (called input space) to the output space Y = 1, 1 (representing the two classes). Such a classifier f labels a class f(x) Y for each point x (when X IR n, x is a vector representing an event with each coponent corresponding to a specific easureent). The classifiers considered here have the for sgn(f), defined as sgn(f)(x) = 1 if f(x) 0 and sgn(f)(x) = 1 if f(x) < 0, induced by real-valued functions. These functions are solutions of soe optiization probles associated with a saple z = (x i, y i ), independently drawn according to a (unknown) probability distribution ρ on Z = X Y. The nature of such an optiization proble (called a Tikhonov regularization schee) is deterined by two objects: a loss function and a hypothesis space. Definition 1. A function φ : IR IR + is called an activating loss (function) for classification if it is convex, φ (0) < 0, and inf t IR φ(t) = 0. Typical exaples of activating loss includes the hinge loss φ h (t) = (1 t) + = ax1 t, 0 for SVM classification and the exponential loss φ exp (t) = e t for boosting. Let φ be an activating loss. For a real-valued function f, when sgn(f) is used for classification or prediction, the local error incurred for the event x and output y will be easured by the value φ(yf(x)). The average of local errors is defined as E φ (f) = φ(yf(x))dρ, called the error or generalization error. Z The convexity and the condition φ (0) < 0 tells that φ(yf(x)) > φ(0) > 0 when yf(x) < 0, i.e., when sgn(f)(x) predicts the class label y incorrectly. So local errors are possibly sall only if yf(x) > 0. Hence iniizing the generalization error is expected to lead to a function predicting the label satisfactorily. This gives the intuition that φ is adissible for classification probles, as verified by any exaples in practice. Since the generalization error involving the unknown distribution ρ is not coputable, 2

3 its discretization is used instead which, coputable in ters of the saple z, is defined as Ez φ (f) = 1 φ ( y i f(x i ) ) and called the epirical error. Regularized learning schees are ipleented by iniizing a penalized version of the epirical error over a set of functions, called a hypothesis space H, equipped with a functional Ω : H IR +. The penalty functional Ω reflects constraints iposed on functions fro the hypothesis space in various desirable fors. Definition 2. Given a function φ : IR IR + and a hypothesis space H together with a penalty functional Ω, the regularized classifier generated for a saple z Z is defined as sgn(f z ), where f z is a iniizer of the Tikhonov regularization schee 1 f z := arg in φ ( y i f(x i ) ) + λω(f). (1.1) f H Here λ is a positive constant called the regularization paraeter. It depends on : λ = λ(), and usually λ() 0 as becoes large. Reproducing kernel Hilbert spaces are often used as the hypothesis space in (1.1). They play an iportant role in learning theory because of their reproducing property. Let K : X X IR be continuous, syetric and positive seidefinite, i.e., for any finite set of distinct points x 1,, x l X, the atrix (K(x i, x j )) l i,j=1 seidefinite. Such a function is called a Mercer kernel. is positive The Reproducing Kernel Hilbert Space (RKHS) H K associated with the Mercer kernel K is defined (see [1]) to be the copletion of the linear span of the set of functions K x = K(x, ) : x X with the inner product, K given by K x, K y K = K(x, y). The reproducing property of H K is K x, f K = f(x), x X, f H K. (1.2) The classical soft argin classifiers [35] correspond to the schee (1.1) with H = H K : 1 f z = arg in φ(y i f(x i )) + λ f 2 K. (1.3) f H K In this paper we introduce a ulti-kernel setting where H is the union of a set of reproducing kernel Hilbert spaces (and Ω(f) is the infiu nor square of f). 3

4 Definition 3. Let K Σ = K σ : σ Σ be a set of Mercer kernels on X. The ulti-kernel space associated with K Σ is defined to be the union H Σ = σ Σ H K σ. For f H Σ, we take f Σ = inf f Kσ : f H Kσ, σ Σ. (1.4) Taking H Σ as the hypothesis space and Ω(f) = f 2 Σ in (1.1) leads to the following schee in the ulti-kernel space H Σ : f z = arg in f H Σ 1 φ ( y i f(x i ) ) + λ f 2 Σ. (1.5) The corresponding ulti-kernel regularized classifier is given by sgn(f z ). Denote ( H Kσ, Kσ ) as ( Hσ, σ ) for siplicity. The regularization schee in the ulti-kernel space H Σ can be rewritten as a two-layer iniization proble: 1 f z = arg in in σ Σ f H σ φ ( y i f(x i ) ) + λ f 2 σ. (1.6) It reduces to (1.3) when Σ contains only one eleent. Our study of general ulti-kernel schees is otivated by recent work on learning algoriths with varying kernels. In [8] support vector achines with ultiple paraeters are investigated. In [19, 24] ixture density estiation is considered and Gaussian kernels with variance σ 2 flexible on an interval [σ 2 1, σ 2 2] with 0 < σ 1 < σ 2 < + are used for deriving bounds. Approxiation properties of ulti-kernel spaces are studied in [44]. Multi-task learning algoriths involve kernels fro a convex hull of several Mercer kernels and spaces with changing nors, e.g. [16, 18]. The first natural concern about the optiization proble (1.5) or (1.6) is the existence of a iniizer. This is assured by the copactness of the index etric set Σ and the continuity of K σ for σ Σ in the next result following fro Proposition 1 given in Section 2. Theore 1. Let φ be an activating loss. If the index set Σ is a copact etric space, and for each pair (x, y), the function K σ (x, y) is continuous with respect to σ Σ, then a solution f z to the ulti-kernel schee (1.6) exists. 4

5 In particular, f z exists in the one-kernel setting (1.3). We shall assue the existence of the optiization proble (1.6) throughout the error analysis of ulti-kernel regularized classifiers, the ain goal of this paper. Let (X, Y) be the rando variable on X Y with the probability distribution ρ. The isclassification error for a classifier f : X Y is defined to be the probability of the event f(x ) Y R(f) = Prob f(x ) Y = P (Y = f(x) x)dρ X. (1.7) Here ρ X is the arginal distribution on X and P ( x) is the conditional distribution. Our target of error analysis is to understand how sgn(f z ) approxiates the Bayes rule, the best classifier with respect to the isclassification error: f c = arg inf R(f) with the infiu taken over all classifiers. Denote η(x) = P (Y = 1 x) and recall the regression function f ρ (x) = ydρ(y x) = P (Y = 1 x) P (Y = 1 x) = 2η(x) 1, x X. (1.8) Y Then the Bayes rule is given (e.g. [15]) by the sign of the regression function f c = sgn(f ρ ). Estiating the excess isclassification error X R(sgn(f z )) R(f c ) (1.9) for the ulti-kernel regularized classification algorith (1.6) is our ain purpose. For the one-kernel setting (1.3) and special choices of φ, the error analysis has been extensively investigated in the literature, especially when ρ is strictly separable (with a positive argin). Besides the hinge loss φ h corresponding to the SVM 1-nor soft argin classifier [35, 25, 28, 11, 37], exaples of loss functions include (1) φ q (t) = (1 t) q + for the SVM q-nor (q > 1) soft argin classifier, see [35, 20, 9]; (2) least square loss φ ls (t) = (1 t) 2, see e.g. [12, 15, 17, 23, 29, 31, 40]; (3) the exponential loss φ exp (t) = e t, see [40, 4]; (4) the logistic regression φ(t) = log(1 + e t ) or 1/(1 + e t ), see [40, 4]. For the error bounds, we will focus on activating loss functions achieving zeros, which allows us to provide a powerful analysis. 5

6 Definition 4. An activating loss is called a classifying loss if the infiu 0 can be achieved. It is called noralized if 1 is its inial zero. Exaples of classifying loss includes the hinge loss φ h, the q-nor loss φ q for SVM classification and the least square loss φ ls (t) = (1 t) 2. They are all noralized. Our error analysis will be done in Sections 3-5. It uses an error decoposition procedure for regularization schee introduced in [9, 38], by the aid of an iteration technique [30, 38] and a projection operator [9]. The convergence rates will be stated in ters of the saple size with proper choices of the regularization paraeter λ = λ() 0. Our analysis is powerful. It yields fast convergence rates. Let us deonstrate this by SVM. Assue X IR n and for soe s > n, the ulti-kernels K Σ satisfy sup K σ C s (X X) <. (1.10) σ Σ It eans that K σ : σ Σ is a set of C s Mercer kernels with a unifor bound. The convergence rate for SVM with such ulti-kernels can be stated as follows. Theore 2. Let φ = φ h and f z by (1.6). Assue for soe 0 < β 1 and c β > 0, inf inf f f c L 1 ρx + λ f 2 σ c β λ β, λ > 0. (1.11) σ Σ f H σ If (1.10) holds for soe s > n, choose λ() = ( ) 1 in 1 2β+(1 β)n/s, 2 1+β. For any ɛ > 0 and 0 < δ < 1, there exists a constant c independent of such that with confidence 1 δ, where θ = in β 2β+(1 β)n/s ɛ, 2β ( 1 ) θ, R(sgn(f z )) R(f c ) c (1.12) 1+β. In Theore 2, the condition (1.11) easures the approxiation power of the ultikernel space H Σ in L 1 ρ X, acting on the function f c. It can be described by soe interpolation spaces of the pair (H Σ, L 1 ρ X ). So only the sign of f ρ is involved in (1.11). If further inforation about the distribution ρ is available, one expects sharper error estiates. For exaple, when ρ satisfies a so-called Tsybakov noise condition ρ X (x X : 0 < f ρ (x) t) t ζ, t > 0, (1.13) 6

7 with soe ζ [0, ] and > 0, then the power θ in the error bound (1.12) can be β(ζ+1) iproved to θ = in β(ζ+2)+(ζ+1 β)n/s ɛ, 2β 1+β. This will be shown in Theore 6 below (in Section 5). Note that any distribution satisfies (1.13) with ζ = 0. The case ζ = is the sae as f ρ (x) or f ρ (x) = 0, eaning that the two classes are well separated. Our result is copletely new for the ulti-kernel setting. Even for the one-kernel setting H Σ = H K, Theore 2 provides the best convergence rate for the SVM under the sae assuption (1.11) of the approxiation power of H K and the regularity condition of the kernel (K C s with s > n): the capacity independent estiates derived by Zhang [40] yield the learning rate (1.12) with θ = β/(1 + β); under the noise condition (1.13), Steinwart and Scovel [30] obtained the learning rate (1.12) with θ = Since s > n, our rate is sharper than theirs. 2β(ζ+1) (2+ζ+ζn/s)(1+β) ɛ. 2. Optiization Proble for Regularization with Multi-kernels We divide the study of the optiization proble (1.6) in two steps. First, fix σ Σ. Denote the optial solution in the RKHS H σ as 1 f z,σ = arg inf f H σ φ(y i f(x i )) + λ f 2 σ. Define the dual function ψ : IR IR of φ by ψ(v) = sup vu φ(u), v IR. (2.1) u IR By the reproducing property (1.2), the optiization proble for solving f z,σ on H σ can be reduced into one on IR. The following relation between the prial proble and its dual is well known (see e.g. [41]): where 1 inf f H σ φ(y i f(x i )) + λ f 2 σ = sup α IR ˆR(α, σ), ˆR(α, σ) := 1 ψ( α i y i ) λ i,j=1 α i K σ (x i, x j )α j, α IR. 7

8 Moreover, both optiizers exist. If ˆα σ = arg ax α IR ˆR(α, σ), then sgn((ˆασ ) i ) = y i and f z,σ (x) = 1 2λ ) (ˆασ i K σ(x i, x). Next, consider the ulti-kernel schee (1.6). A solution f z can be represented as f z (x) = 1 2λ ˆα i Kˆσ (x i, x) if an optial point (ˆα, ˆσ) of the following dual proble exists: (ˆα, ˆσ) = arg in σ Σ ax α IR ˆR(α, σ). (2.2) We show that under soe ild condition, (2.2) can be solved. Proposition 1. Under the conditions of Theore 1, an optial point (ˆα, ˆσ) of (2.2) can be achieved. Hence an optial solution f z to the ulti-kernel regularization schee (1.5) always exists. Proof. We first clai that there exists a constant C(φ, ) depending on φ and the saple size such that ˆα σ l (IR ) C(φ, ), σ Σ. (2.3) To verify our clai, recall that ˆα σ is a axiizer of ˆR(α, σ). This yields ˆR(ˆα σ, σ) ˆR(0, σ) = ψ(0) = sup 0 φ(u) = inf φ(u) = 0. u IR u IR But K σ is positive seidefinite, it follows that ψ ( (ˆα σ ) i y i ) = ˆR(ˆασ, σ) 1 4λ However, for each v IR, (ˆα σ ) i K σ (x i, x j )(ˆα σ ) j ˆR(ˆα σ, σ) = 0. i,j=1 Therefore, for each i 1,,, we have ψ ( (ˆα σ ) i y i ) 0 ψ( v) = sup uv φ(u) φ(0). u IR j i ψ( (ˆα σ ) j y j ) j i 8 φ(0) = ( 1)φ(0). (2.4)

9 Now we prove our clai in two cases. Recall that the convexity of φ iplies that the one-side derivatives φ + and φ exist, are nondecreasing, and satisfy φ (t) φ +(t) for any t IR. Case 1: φ +(t) 0 for each t IR. In this case, φ is nonincreasing and li u + φ(u) = inf u IR φ(u) = 0. This in connection with the definition of the dual function iplies ψ( v) = sup uv φ(u) li uv = +, v < 0. (2.5) u IR u + It follows fro (2.4) that (ˆα σ ) i y i 0 for each i. Definition 1 also tells us that φ is strictly decreasing on (, 0] and li t φ(t) = +. Then the inverse function φ 1 is well defined on [φ(0), + ). Choose u = φ 1 ( v) for v (φ(0)) 2 in the definition of ψ, we see that ψ( v) vφ 1 ( v) φ ( φ 1 ( v) ). It follows that for any v ax 1, ( φ( 2) ) 2 there holds ψ( v) v vφ 1 ( v) 1 v. Hence v ax1, ( φ( 2) ) 2, ( ψ( v) ) 2, v IR. Cobining with (2.4), this iplies that (ˆα σ ) i y i ax 1, ( φ( 2) ) 2, ( 1) 2 ( φ(0) ) 2 =: C 1 (φ, ). As y i = ±1 and sgn((ˆα σ ) i ) = y i, we know that (ˆασ ) i = (ˆασ ) i y i = (ˆασ ) i y i C 1 (φ, ) for each i. This proves our clai in Case 1: ˆα σ l (IR ) C 1 (φ, ). Case 2: φ +(t 0 ) > 0 for soe t 0 IR. In this case, t 0 > 0 and φ is strictly increasing on [t 0, + ). Then for v in 1, ( φ(t 0 +2) ) 2, there exists soe uv t 0 +2 such that φ(u v ) = v. Choosing u = u v in the definition of ψ, we see that ψ( v) u v v φ(u v ) can be bounded fro below as ψ( v) v v(t 0 + 2) 1 v, v in 1, ( φ(t 0 + 2) ) 2. (2.6) On the other hand, since φ is strictly decreasing on (, 0], for v ax 1, ( φ( 2) ) 2 there exists soe u v 2 such that φ(u v ) = v. It follows that ψ( v) u v v φ(u v ) v ( ) 2 u v v 1 = v, v ax 1, φ( 2). 9

10 This in connection with (2.6) iplies that ψ( v) > ( 1)φ(0) whenever v > ax ( 1) 2( φ(0) ) 2 (, φ(t0 + 2) ) 2 ( ) 2, 1, φ( 2) =: C 2 (φ, ). Cobining with (2.4), we see again that (ˆασ ) i = ˆασ,i y i C2 (φ, ) for each i 1,,. This proves our clai in Case 2: ˆα σ l (IR ) C 2 (φ, ). Therefore, (2.3) holds with C(φ, ) = ax C 1 (φ, ), C 2 (φ, ). Next, we apply our clai (2.3) to prove the proposition. Denote Ĝ(σ) = ax α IR ˆR(α, σ) = ˆR(ˆασ, σ). To prove the existence of a solution (ˆα, ˆσ) = (ˆαˆσ, ˆσ) to the proble (2.2), it is sufficient to prove that the function Ĝ(σ) is continuous on the copact etric space ( Σ, d Σ ). Let σ 1, σ 0 Σ. By the definition of Ĝ(σ) and ˆR(α, σ), we have Ĝ(σ 1 ) Ĝ(σ 0) = ˆR(ˆα σ1, σ 1 ) ˆR(ˆα σ0, σ 0 ) ˆR(ˆα σ1, σ 1 ) ˆR(ˆα σ1, σ 0 ) = 1 ) 4 2 (ˆα σ1 ) i (K σ0 (x i, x j ) K σ1 (x i, x j ) (ˆα σ1 ) j. λ By syetry, there holds Ĝ(σ 0 ) Ĝ(σ 1) λ i,j=1 i,j=1 ) (ˆα σ0 ) i (K σ1 (x i, x j ) K σ0 (x i, x j ) (ˆα σ0 ) j. By the continuity of K σ (x i, x j ) at σ 0 for each pair (i, j), we know that for any ε > 0, there exists soe δ > 0 such that Kσ1 (x i, x j ) K σ0 (x i, x j ) 4λε/ ( C(φ, ) ) 2 whenever d Σ (σ 1, σ 0 ) < δ. It follows fro (2.3) and the above two bounds that Ĝ(σ 1 ) Ĝ(σ 0) ε. This shows the continuity of Ĝ at σ 0. Since σ 0 is an arbitrary point in Σ, Ĝ(σ) is continuous on Σ. Therefore, a iniizer of Ĝ(σ) in Σ exists: ˆσ = arg inf σ Σ Ĝ(σ). Thus, inf ax σ Σ α ˆR(α, σ) = inf Ĝ(σ) = Ĝ(ˆσ) = ax σ Σ α ˆR(α, ˆσ). Moreover the axiizer of ˆR(α, ˆσ) always exists. This tells us that the general optiu of ˆR(α, σ) is achievable. By the the relationship between the prial proble and its dual, we obtain the existence of the ulti-kernel regularization schee (1.5). This copletes the proof of the proposition. Exaple 1. Let Σ = [σ 1, σ 2 ] with 0 < σ 1 σ 2 < and K σ be the Gaussian kernel K σ (x, y) = exp x y 2 2σ 2 on a copact subset X of IR n. Then a solution to the optiization proble (1.6) exists. 10

11 3. Error Analysis: A General Fraework In this section, we give a general fraework of our error analysis, consisting of a coparison theore, a projection operator and an error decoposition procedure. It provides bounds for the excess isclassification error in ters of a regularization error and a saple error, studied in the next two sections separately Coparison Theores Siilar to the learning rate stated in Theore 2, the error analysis ais at bounding the excess isclassification error R(sgn(f z )) R(f c ). But the algorith is designed by iniizing a penalized epirical error Ez φ associated with the loss function φ. Knowledge on regularization schees or epirical risk iniization processes would only lead us to expect the convergence of E φ (f z ) as. So relations between isclassification error and generalization error becoe crucial. Soe works have been done on this topic [4, 40, 2]. Here we only ention soe coparison theores which will be used in the paper. Denote IR = IR ±. Define f φ ρ = arg in E φ (f) with the iniu taken over all functions f : X IR. Note that fρ φ always exists since φ is convex. It satisfies sgn(fρ φ ) = f c, an adissible condition for the loss function, see [29, 2]. The first coparison theore is for the hinge loss φ h (t) = (1 t) +. Proposition 2. Let φ = φ h be the hinge loss. We have f φ h ρ function f : X IR, = f c and for every easurable R(sgn(f)) R(f c ) E φ h (f) E φ h (f c ). (3.1) The fact f c = f φ h ρ was proved in [36]. The relation (3.1) was proved in [40]. The following coparison theore for general activating loss functions was given in [9]. Note that the convexity of φ iplies φ (0) 0. 11

12 Proposition 3. If an activating loss φ satisfies φ (0) > 0, then there exists a constant c φ > 0 such that for any easurable function f : X IR, there holds R(f) R(f c ) c φ E φ (f) E φ (f φ ρ ). Tighter coparison bounds are possible under soe noise conditions. We say that ρ has a Tsybakov noise exponent α 0 if for soe c α > 0 and every easurable f : X Y, ρ X (x X : f(x) f c (x)) c α (R(f) R(f c )) α. (3.2) All distributions satisfy (3.2) with α = 0 and c α = 1. The following sharper coparison bound for α > 0 follows iediately fro [4, Lea 6] and Proposition 3. Corollary 1. Let φ be a classifying loss satisfying φ (0) > 0. If ρ satisfies the Tsybakov noise condition (3.2) for soe α [0, 1] and c α > 0, then R(sgn(f)) R(f c ) 2c φ c α ( E φ (f) E φ (f φ ρ )) 1/(2 α), f : X IR Projection Operator By coparison theores, we only need to bound the excess generalization error E φ (f z ) E φ (fρ φ ) in order to study the perforance of the classifier sgn(f z ). But we can do better using the special feature of a classifying loss that it achieves a zero. A key technical tool here is a projection operator. To siply the notations and stateents, we will restrict our discussion only for noralized classifying loss functions. Firstly we show that the target function fρ φ can be chosen to be bounded. Set a univariate convex function Q for x X as Q(t) = Q x (t) := φ(yt)dρ(y x), t IR. (3.3) Y Its one-side derivatives exist, are nondecreasing and satisfy Q (t) Q +(t) for every t IR. Denote f ρ (x) = sup t IR : Q (t) < 0, f + ρ (x) = inf t IR : Q +(t) > 0. 12

13 Theore 3. Let φ be a noralized classifying loss function. Then (a) for each x X, the univariate function Q given by (3.3) is strictly decreasing on (, fρ (x)], strictly increasing on [f ρ + (x), + ), and is constant on [fρ (x), f ρ + (x)]. (b) fρ φ : X IR is a iniizer of the generalization error E φ (f) if and only if for alost every x (X, ρ X ), fρ φ (x) is a iniizer of Q, that is, there holds fρ (x) fρ φ (x) f ρ + (x). (3.4) (c) we ay choose a iniizer fρ φ of E φ satisfying fρ φ [ 1, 1] for each x X. Proof. Let x X. Consider the univariate continuous function Q given by (3.3). It is strictly decreasing on the interval (, fρ (x) ), since Q (t) < 0 on this interval. In the sae way, Q +(t) > 0 for t > f ρ + (x), so Q is strictly increasing on (f ρ + (x), + ). For t (fρ (x), f ρ + (x)), we have 0 Q (t) Q +(t) 0, hence Q is constant which is the inial value of Q on IR. This proves (a). Since E φ (f) = X Q x(f(x))dρ X (x), the stateent (b) follows directly fro (a). By the assuption, φ is convex and has inial zero 1. This iplies that φ is strictly decreasing on (, 1] and nondecreasing on [1, + ). So Q(t) Q(1) for t > 1 and Q(t) Q( 1) for t < 1. So a iniu of Q can always be achieved on [ 1, 1]. Hence we ay choose fρ φ such that fρ φ (x) [ 1, 1]. This proves the stateent (c). In what follows we shall always choose fρ φ with fρ φ (x) 1 for noralized classifying loss functions. Then we can ake full use of the projection operator introduced in [9]. Definition 5. The projection operator π is defined on the space of easurable functions f : X IR as 1, if f(x) > 1, π(f)(x) = 1, if f(x) < 1, (3.5) f(x), if 1 f(x) 1. It is easy to see that π(f) and f induce the sae classifier, i.e., sgn(π(f)) = sgn(f). Apply this fact to coparison theores. It is sufficient for us to bound the excess generalization error for π(f z ) instead of f z. This leads to better estiates, as we will see later. The following property of projection operator is iediate fro the definition of φ. 13

14 Proposition 4. If φ is noralized classifying loss function, then there holds alost surely φ(yπ(f)(x)) φ(yf(x)). (3.6) Hence for any easurable function f, we have E φ (π(f)) E φ (f) and E φ z (π(f)) E φ z (f) Error Decoposition Now we can present the error decoposition which leads to bounds of the excess generalization error for π(f z ). Define f λ = arg in E φ (f) + λ f 2 Σ. f H Σ Proposition 5. Let φ be a noralized classifying loss and f z given by (1.6). Then E φ (π(f z )) E φ (f φ ρ ) + λ f z 2 Σ D(λ) + S z,λ, (3.7) where D(λ) is the regularization error of the ulti-kernel space H Σ defined [27] as D(λ) = inf inf σ Σ f H σ E φ (f) E φ (f φ ρ ) + λ f 2 σ (3.8) and S z,λ = E φ (π(f z )) Ez φ (π(f z )) + Ez φ (f λ ) E φ (f λ ). (3.9) Proof. Write E φ (π(f z )) E φ (f φ ρ ) + λ f z 2 Σ as E φ (π(f z )) Ez φ (π(f z )) + Ez φ (f λ ) E φ (f λ ) + (E φ + z (π(f z )) + λ f z 2 ( Σ) E φ z (f λ ) + λ f λ 2 ) Σ E φ (f λ ) E φ (f φ ρ ) + λ f λ 2 Σ. By Proposition 4, E φ z (π(f z )) E φ z (f z ). This in connection with the definition of f z tells us that the second ter is 0. Note that S z,λ is just the su of the first and third ters. By the definition of f λ, the last ter equals to D(λ). This proves (3.7). The regularization error ter D(λ) in the error decoposition (3.7) is independent of the saple. It can be estiated by K-functionals by the discussion in Section 4. 14

15 The last ter S z,λ in (3.7) is called the saple error. Without projection, it is well understood because of the vast literature in learning theory. We are able to iprove the saple error estiates, stated in Theore 5 below, because of the projection operator. Coparison theores and the error decoposition help switch the goal of the error analysis to the estiation of the regularization error and the saple error. For instance, to prove Theore 2, we first apply Proposition 2 to π(f z ) and then Proposition 5. It tells us that R(sgn(f z )) R(f c ) is bounded by the su of S z,λ and D(λ). 4. Estiating Regularization Error and Approxiation Error In this section, we discuss the estiation of the regularization error. The convexity of φ iplies that φ (t) = φ +(t) = φ (t) for alost every t IR. Theore 4. Let φ be a noralized classifying loss. Then E φ (f) E φ (f φ ρ ) φ L [ f, f ] f f φ ρ L 1 ρx. If oreover, φ is C 1 and φ is absolutely continuous on IR, we have E φ (f) E φ (f φ ρ ) φ L [ f 1, f +1] f f φ ρ 2 L 2 ρ X. Proof. With the function Q = Q x defined in (3.3), write E φ (f) E φ (fρ φ ) as E φ (f) E φ (fρ φ ) = Q(f(x)) Q(f φ ρ (x)) dρ X. X Since φ (0) < 0 and φ(t) 0, we have φ(0) > 0 and φ ±(t) < 0 for t < 0. Let P (t) = ax φ ±(t), φ ±( t) for t > 0. We only need to prove Q(f(x)) Q(f φ ρ (x)) P ( f(x) ) f(x) f φ ρ (x) (4.1) for those x with Q(f(x)) Q(fρ φ (x)) > 0. According to Theore 3, such a point x satisfies f(x) [fρ (x), f ρ + (x)]. If f(x) > f ρ + (x), then Q is strictly increasing on [f(x), + ). Hence f(x) > fρ φ (x). By Theore 3, we have Q(f(x)) Q(f φ ρ (x)) Q (f(x)) ( f(x) f φ ρ (x) ). 15

16 Note that both φ and φ + are nondecreasing, and Q(t) = η(x)φ(t) + (1 η(x))φ( t). Hence Q (f(x)) = η(x)φ (f(x)) (1 η(x))φ +( f(x)) ax φ ±( f(x) ), φ ±( f(x) ) no atter whether f(x) 0 or not. Thus, (4.1) holds true when f(x) > f + ρ (x). In the sae way, if f(x) < f ρ (x), then Q is strictly decreasing on (, f(x)]. Hence f(x) < f φ ρ (x). Theore 3 yields again Q(f(x)) Q(f φ ρ (x)) Q +(f(x)) ( f φ ρ (x) f(x) ). Since Q +(f(x)) = η(x)φ +(f(x)) + (1 η(x))φ ( f(x)) P ( f(x) ), we see that (4.1) also holds when f(x) < f ρ (x). This proves the first stateent. If φ is C 1 and φ is absolutely continuous on IR, we know fro Theore 3 that Q (f φ ρ (x)) = 0. Hence Q(f(x)) Q(f φ ρ (x)) = f(x) f φ ρ (x) Q (u) Q (fρ φ (x))du Q L (I) f(x) fρ φ (x) 2 2 where I is the interval between f φ ρ (x) and f(x). Then the second stateent follows. In the above, L q ρ X is the L q space with nor f L q ρx = 1/q. X f(x) q dρ X Thus, we can use the rich knowledge fro approxiation theory to estiate the regularization error. See [9] for details on bounding the regularization error for the SVM q-nor soft argin classifiers by eans of K-functionals in L q ρ X. One advantage of ulti-kernel algoriths is the iproveent of regularization errors copared with the one-kernel setting. For exaples and discussion, see [44, 30, 26]. 5. Saple Error Estiates and Learning Rates We are in a position to estiate the saple error and derive the learning rates. Throughout this section, we assue that the kernels are uniforly bounded in the sense that κ := sup K C(X X) <. (4.2) σ Σ 16

17 To state our result, we need to further introduce several concepts and notations. The quantity E φ (π(f z )) E φ z (π(f z )) in the saple error (3.9) needs to be estiated by soe unifor law of large nubers. To this end, we need the capacity of the hypothesis space, which plays an essential role in saple error estiates. In this paper, we use the covering nubers easured by epirical distances. Definition 6. Let F be a set of functions on Z and z = z 1,, z Z. The etric d 2,z is defined on F by 1 d 2,z (f, g) = ( f(zi ) g(z i ) ) 2 1/2. For every ε > 0, the covering nuber of F with respect to d 2,z is defined as N 2,z (F, ε) = inf l IN : f i l F such that F = l f F : d 2,z (f, f i ) ε. The function sets in our situation are balls of the ulti-kernel space in the for of B R = f H Σ : f Σ R = σ Σ f H σ : f σ R. We need the epirical covering nuber of B 1 defined as ( ) N (ε) = sup sup N 2,x B 1, ε. (4.3) IN x X For a function f : Z IR, denote IEf = f(z)dρ. Z Theore 5. Let φ is a noralized classifying loss. Assue the following conditions with exponents q > 0, τ [0, 1] and p (0, 2): (1) an increent condition for φ with a constant c q > 0 φ(t)) c q t q, t 1, (4.4) (2) a variance-expectation bound for the pair (φ, ρ) with the exponent τ and soe c τ > 0 (φ(yf(x)) IE φ(yf φ ρ (x)) ) 2 τ c τ E φ (f) E φ (fρ ) φ, f 1, (4.5) (3) a capacity condition for the function set B 1 with a constant c p > 0 log N (ε) c p ( 1 ε ) p, ε, R > 0, IN. (4.6) 17

18 If D(λ) c β λ β for soe 0 < β 1 and c β > 0, then for any ɛ > 0 and 0 < δ < 1, there exists a constant c independent of such that, with λ = λ() = ( 1 ) γ, we have with confidence 1 δ, where E φ (π(f z )) E φ (fρ φ ) c ( 1 ) θ (4.7) 2 γ = in β(4 2τ + pτ) + p(1 β), 2 2β + q βq θ = in 2β β(4 2τ + pτ) + p(1 β) ɛ, 2β 2β + q βq The proof of Theore 5 will be given at the end of this section. Theore 5, let us reark the assuptions. (4.8). (4.9) Before applying The increent condition (4.4) is satisfied for any useful loss functions including the hinge loss and least square loss. The variance-exponent condition (4.5) for the pair (φ, ρ) always holds for τ = 0 with c τ = (axφ( 1), φ(1)) 2. This can be seen fro the fact that φ(yf(x)) φ(yf φ ρ (x)) axφ( 1), φ(1). Larger exponents τ are possible when φ has high convexity (such as φ ls in Theore 7 below) or when the distribution ρ satisfies soe conditions (such as the Tsybakov noise condition (1.13) in Theore 6 below). The capacity condition (4.6) always holds with p 2 if K Σ contains only one kernel. Note that for any function set F C(X), the epirical covering nuber N 2,x ( F, ε ) is bounded by N ( F, ε ), the (unifor) covering nuber of F under the etric, since d 2,x (f, g) f g. So in the ulti-kernel setting, the behavior of the covering nuber N (ε) can be estiated by the unifor soothness of kernels in Σ according to [43]. Exaple 2. If the set Σ of kernels on X IR n satisfies (1.10) for soe s > 0, then there is a constant c s > 0 such that log N (ε) c s ( 1/ε ) 2n/s for any ε > 0. The regularization error D(λ) decays to zero once H Σ is dense in C(X). By the discussion in Section 4, the decay rate with an exponent β can be estiated if soe priori knowledge on the distribution is available; see [9] for explicit exaples. Let us now show how to apply Theore 5 to derive learning rates. Recall Proposition 3 and Corollary 1. A direct corollary of Theore 5 is as follows. 18

19 Corollary 2. Under the assuption of Theore 5, if φ (0) > 0, then for any ɛ > 0 and 0 < δ < 1, there is a constant c independent of such that with confidence 1 δ, ( 1 ) θ/2 R(sgn(f z )) R(f c ) c (4.10) where λ = ( 1 ) γ, γ, θ are given by (4.8) and (4.9), respectively. If, in addition, ρ satisfies the noise condition (3.2) with 0 < α 1, the power θ 2 in (4.10) can be iproved to 1 2 α θ. Next we consider two classical classification algoriths: SVM classification and least square ethod Learning Rates for the SVM Classification For the SVM classification with the hinge loss, we illustrate how noise conditions on the distribution ρ raise the variance-expectation exponent τ in (4.5) fro 0 (for general distributions) to τ = ζ/(ζ + 1) > 0. Theore 6. Let φ = φ h and the ulti-kernels K σ : σ Σ satisfy (4.6). Assue inf inf E φ h (f) E φ h (f c ) + λ f 2 σ c β λ β, λ > 0 (4.11) σ Σ f H σ with 0 < β 1, c β > 0, and that ρ satisfies the noise condition (1.13) with ζ [0, ] and > 0. Choose λ = λ() = ( 1 )in 2(ζ+1) β(ζ+2)+p(ζ+1 β), 2 β+1. For any ɛ > 0 and 0 < δ < 1, there exists a constant C ɛ > 0 independent of such that with confidence 1 δ, ( θ 1 2β(ζ + 1) R(sgn(f z )) R(f c ) C ɛ, θ = in ) 2β(ζ + 2) + p(ζ + 1 β) ɛ, 2β 1 + β. Proof. Observe that φ h satisfies the increent condition (4.4) with q = 1 and c q = 2. Because of the noise condition (1.13), we know fro [30] and [38] that the condition (4.5) is valid with the exponent τ = ζ ζ+1 and the constant c τ = 8 ( 1 2 ) ζ/(ζ+1). Then the conclusion follows fro Theore 5 and Proposition 2. Theore 2 stated in the introduction is a special case of Theore 6 with ulti-kernels having a unifor bound in C s. Proof of Theore 2. By Exaple 2, (4.6) holds with p = 2n/s. Since φ h is Lipschitz, Theore 4 yields E φ h (f) E φ h (f c ) f f c L 1 ρx. Hence (1.11) iplies (4.11). Take ζ = 0 since no assuption on the noise is ade. We see Theore 2 follows fro Theore 6. 19

20 5.2. Learning Rates with the Least-square Loss Consider the least-square loss φ ls (t) = (1 t) 2 investigated in [31]. We illustrate how high convexity of the loss function yields large variance-expectation exponent τ in (4.5). Here φ ls (yf(x)) = (1 yf(x)) 2 = (y f(x)) 2 since y 2 = 1 for y Y. So we know [35] that f φ ρ = f ρ and the high convexity of φ ls ensures [12] that (4.5) holds true with τ = 1 and C τ = 1. The increent condition (4.4) for φ ls is true with q = 2. Moreover, E φ ls (f) E φ ls (f ρ ) = f f ρ 2 L 2 ρ X. Putting all these into Proposition 3 and Corollary 2, we obtain the following learning rate. Theore 7. Consider (1.6) with φ = φ ls and and ulti-kernels K σ : σ Σ satisfying (4.6) with soe p (0, 2). Assue that for soe 0 < β 1 and c β > 0, inf inf f f ρ 2 L + λ f 2 2 ρ σ c β λ β, λ > 0. (4.12) X σ Σ f H σ Then by choosing λ = λ() = ( 1 )in 2 2β+p,1, for any ɛ > 0 and 0 < δ < 1, there exists a constant C ɛ independent of such that with confidence 1 δ, R(sgn(f z )) R(f c ) C ɛ ( 1 ) θ with θ = 1 2β 2 in 2β + p ɛ, β. (4.13) If oreover, ρ satisfies (3.2), then θ can be iproved to 1 2 α in 2β 2β+p. ɛ, β In particular, when inf x X f ρ (x) > 0, (4.13) holds with θ = in 2β 2β+p. ɛ, β The above learning rate is better than those in the literature, e.g. [13, 23, 6, 40]. When the kernels are C with (1.10) valid for any s > 0, we ay take p in Theore 7 to be arbitrarily sall and the power θ in (4.13) becoes in1/2 ɛ, β/2. Exaple 3. Let φ(t) = (1 t) 2, Σ = [σ 1, σ 2 ] with 0 < σ 1 σ 2 < and K σ be the Gaussian kernel K σ (x, y) = exp x y 2 on X IR n. Assue (4.12). Let ɛ > 0 and λ = λ() = ( 1 ) in 1 2σ 2 β ɛ,1. Then with confidence 1 δ, we have ( 1 ) θ/2, R(sgn(f z )) R(f c ) c θ = in 1 ɛ, β. If ρ satisfies the noise condition (3.2) with 0 < α 1, then θ/2 can be iproved to 1 2 α θ = 1 2 α in 1 ɛ, β. When inf x X f ρ (x) > 0, we can replace θ/2 by in1 ɛ, β. 20

21 5.3. Proof of the Main Result To end this section, we prove our ain result, Theore 5. To this end, we shall use the following concentration inequality. Proposition 6. Let F be a set of easurable functions on Z, and B, c > 0, τ [0, 1] be constants such that each function f F satisfies f B and IE(f 2 ) c(ief) τ. If for soe a > 0 and p (0, 2), sup IN sup log N 2,z (F, ε) aε p, ε > 0, (4.14) z Z then there exists a constant c p depending only on p such that for any t > 0, with probability at least 1 e t, there holds where IEf 1 f(z i ) 1 ( ct ) 1/(2 τ) 2 η1 τ (IEf) τ + c 18Bt pη + 2 +, f F, η := ax c 2 p 4 2τ+pτ ( a ) 2 4 2τ+pτ To prove Proposition 6, we need soe preparations. (, B 2 p 2+p a 2 2+p. ) Definition 7. A function ψ : IR + IR + is sub-root if it is non-negative, non-decreasing, and if ψ(r)/ r is non-increasing. For a sub-root function ψ and any D > 0, the equation ψ(r) = r/d has a unique positive solution. The following proposition is given in [3, Theore 3]. Proposition 7. Let F be a class of easurable, square integrable functions such that IEf f b for all f F. Let ψ be a sub-root function, D be soe positive constant and r be the unique solution to ψ(r) = r/d. Assue that [ IE ax 0, sup IEf 1 f F IEf 2 r ] f(z i ) ψ(r), r r. Then for all t > 0, and all K > D/7, with probability at least 1 e t there holds IEf 1 f(z i ) IEf 2 K + 50K D 2 r + 21 (K + 9b)t, f F.

22 We need to find the sub-root function ψ in our setting. To this end, introduce the Radeacher variables ε i, i = 1,,. Then [ IE sup f F IEf 2 r IEf 1 ] f(z i ) [ 2IE sup f F IEf 2 r 1 ] ε i f(z i ). (4.15) The right hand side is called the local Radeacher process. It can be bounded by using epirical covering nubers and the entropy integral. See [34]. is given. The following result is a scaled version of Proposition 5.4 in [30] where the case B = 1 Proposition 8. Let F be a class of easurable functions fro Z to [ B, B]. Assue (4.14) for soe p (0, 2) and a > 0. Then there exists a constant c p depending only on p such that [ IE sup f F IEf 2 r 1 ] ε i f(z i ) c p ax r 1/2 p/4( a ) 1/2, ( 2 p B 2+p a 2/(2+p). ) According to Proposition 8 and (4.15), in applying Proposition 7, one should take ψ(r) = 2c p ax r 1/2 p/4( a ) 1/2, ( 2 p B 2+p a 2/(2+p). (4.16) ) Then the solution r to the equation ψ(r) = r/d satisfies ( r ax (2c p D) 4 2+p, 2cp DB 2 p 2+p a 2 2+p. (4.17) ) Proof of Proposition 6. Let ψ be defined by (4.16) and r be the solution to ψ(r) = r/d. Since f B, we have IEf f b := 2B for each f F. Choose K = D/5. By Proposition 7 and the condition IEf 2 c(ief) τ we know that with probability at least 1 e t there holds IEf 1 f(z i ) 5c D (IEf)τ + 10 D r + ( D B)t, f F. (4.18) Then 5c D Recall that r satisfies (4.17). Take D = 10cη τ 1 where η is given in our stateent. = 1 2 η1 τ. The expression of η in connection with the bound (4.17) for r tells 22

23 us that 10 D r c p η where c p is a constant depending only on p and c p, hence only on p. Observe fro the choice of D that Dt 5 = 2ct η 1 τ ( ct ) 1/(2 τ) 2 ax η,, according to whether η ( ct ) 1/(2 τ). Take c p to be the constant c p + 2 depending only on p. Then the desired inequality holds for each f F. This proves Proposition 6. We now turn to our key analysis and prove Theore 5. Let us first explain our ain ideas. In the saple error ter of (3.7), the quantity Ez φ (f λ ) E φ (f λ ) is easy to handle. It can be estiated by the one-side Bernstein inequality for the single rando variable φ(yf λ (x)) on Z. This will be done in the first step of the proof with a ild technical odification: consider the rando variable ξ = φ(yf λ (x)) φ(y, fρ φ (x)) instead of φ(yf λ (x)). The quantity E φ (π(f z )) Ez φ (π(f z )) is ore difficult and we need Proposition 6 to estiate. Here the function set will be F = φ ( yπ(f)(x) ) φ ( yfρ φ (x) ) : f B R with such a radius R that B R contains f z, i.e., R is a bound of f z Σ. On the other hand, saller radius R yields better estiates. Hence good bounds for f z Σ play an iportant role for the saple error estiates. A rough bound for f z Σ iediately follows fro the definition of f z. By choosing f = 0, we find λ f z 2 Σ E z φ (f z ) + λ f z 2 Σ E z φ (0) + λ 0 = φ(0). This proves Lea 1. For every λ > 0, there holds f z Σ φ(0)/λ. We ay use the bound φ(0)/λ as R in F and apply Proposition 6 to get soe rough estiates for E φ (π(f z )) Ez φ (π(f z )). However, the epirical error Ez φ (f) is a good approxiation of the generalization error E φ (f). Hence the penalty value f z Σ is expected to be close to f λ Σ which is bounded by D(λ)/λ: λ f λ 2 Σ E φ (f λ ) E φ (fρ φ ) + λ f λ 2 Σ = D(λ). (4.19) This expectation will be realized by an iteration technique used in [30] and [38]. By this technique, we shall show under soe assuptions that with high confidence f z Σ has a bound arbitrarily close to D(λ)/λ (in the order of λ). 23

24 We are in a position to estiate the saple error and prove Theore 5. Proof of Theore 5. Write the saple error as ( S z,λ = E φ (π(f z )) E φ (f φ + ρ ) ) ( ) Ez φ (π(f z )) Ez φ (fρ φ ) ( ) ( Ez φ (f λ ) Ez φ (fρ φ ) E φ (f λ ) E φ (fρ )) φ := S 1 + S 2. We divide our estiation into three steps. Take t 1 which will be deterined later. Denote B = axφ( 1), φ(1). Denote Step 1: estiate S 2. Consider the rando variable ξ = φ(yf λ (x)) φ(yf φ ρ (x)) on Z. ξ = ξ 1 + ξ 2 = φ(yf λ (x)) φ(yπ(f λ )(x)) + φ(yπ(f λ )(x)) φ(yfρ φ (x)). First we bound ξ 1. By (1.2), (4.2) and (4.19), we have f λ κ f λ Σ κ D(λ)/λ. We ay assue the last quantity to be greater than one since otherwise ξ 1 0. Then the increent condition on φ tells us 0 ξ 1 B λ := c q κ q( D(λ)/λ ) q/2. Hence ξ1 IE(ξ 1 ) B λ. Applying the one-side Bernstein inequality to ξ 1, we know that for any ε > 0, 1 Prob Solving the quadratic equation ξ 1 (z i ) IEξ 1 > ε exp ε 2 2 ( σ 2 (ξ 1 ) B λε ) = t ε 2 2 ( σ 2 (ξ 1 ) B λε ) for ε, we see that there exists a subset U 1 of Z with easure at least 1 e t such that for every z U 1,. 1 ξ 1 (z i ) IEξ 1 1 ( 3 B λt B λt ) 2 + 2σ2 (ξ 1 )t 2B λt 2t 3 + σ2 (ξ 1 ). But the fact 0 ξ 1 B λ iplies σ 2 (ξ 1 ) B λ IE(ξ 1 ). Therefore, we have 1 ξ 1 (z i ) IEξ 1 7B λt 6 + IEξ 1, z U 1. 24

25 Next we consider ξ 2. Since both yπ(f λ )(x) and yf φ ρ (x) are on [ 1, 1], ξ 2 is a rando variable satisfying ξ 2 B. Applying the one-side Bernstein inequality as above, we know that there exists another subset U 2 of Z with easure at least 1 e t such that for every z U 2, 1 ξ 2 (z i ) IEξ 2 2Bt 3 + 2tσ2 (ξ 2 ). By (4.5), we have σ 2 (ξ 2 ) C τ (IEξ 2 ) τ. Applying the eleentary inequality with q = 2 2 τ, q = 2 τ 1 q + 1 q = 1 with q, q > 1 = a b 1 q aq + 1 q bq, a, b 0 and a = 2tC M, b = (IEξ 2 ) τ, we see that 2tσ2 (ξ 2 ) 2tCτ (IEξ 2 ) τ ( 1 τ ) ( ) 1 2 τ 2tC τ 2 + τ 2 IEξ 2. Hence 1 ξ 2 (z i ) IEξ 2 2Bt ( 3 + 2tCτ ) 1 2 τ + IEξ 2, z U 2. Cobine the above estiates for ξ 1 and ξ 2 with the fact IEξ 1 + IEξ 2 = IEξ D(λ) c β λ β. We conclude that S 2 7B λt + 4Bt 6 + ( 2tCτ ) 1 2 τ + D(λ), z U 1 U 2. (4.20) Step 2: estiate S 1. By Proposition 5, one has z := E φ (π(f z )) E φ (f φ ρ ) + λ f z 2 Σ S 1 + S 2 + D(λ). (4.21) Let R > 0. Apply Proposition 6 to the function set F = φ ( yπ(f)(x) ) φ ( yf φρ (x) ) : f B R. Since φ ( yπ(f)(x) ) φ ( yπ(g)(x) ) φ ( 1) π(f)(x) π(g)(x) φ ( 1) f(x) g(x), there holds N 2,z (F, ε) N 2,z (B R, Hence (4.6) yields (4.14) with a = c p φ ( 1) p R p. 25 ) ε φ. ( 1)

26 Since φ ( yπ(f)(x) ) B and φ ( yf φ ρ (x) ) B, we know that f B for every f F. The assuption (4.5) tells us that IEf 2 c(ief) τ with c = C τ. Thus all the conditions in Proposition 6 hold, and we know that there is a subset V(R) of Z with easure at least 1 e t such that for every z V(R) and every f B R, ( ) ( ) E φ (π(f)) E φ (fρ φ ) Ez φ (π(f)) Ez φ (fρ φ ) 1 2 η1 τ R ( E φ (π(f)) E φ (f φ ρ )) τ + c p η R + 2 ( Cτ t ) 1 2 τ + 18Bt, (4.22) where η R = η is given in Proposition 6 with c = C τ and a = c p φ ( 1) p R p, i.e., η R = ax C 2 p 4 2τ+pτ τ ( cp φ ( 1) p R p Let W(R) be the subset of Z defined by W(R) = ) 2 4 2τ+pτ, B 2 p 2+p z U 1 U 2 : f z B R. ( cp φ ( 1) p R p ) 2 2+p. Let z W(R) V(R). Then (4.22) holds for f z. Together with the estiate (4.20) for S 2 and (4.21), we know that z 1 2 η1 τ R When τ = 1 this yields ( E φ (π(f z )) E φ (f φ ρ )) τ + c p η R Bt + 3B λt/2 + 2D(λ). ( Cτ t ) 1/(2 τ) ( z c Cτ t ) 1/(2 τ) 38Bt + 3B λ t pη R D(λ), (4.23) where c p = ax2c p, 1. Here we have bounded 2c p by c p. When 0 < τ < 1, we use the eleentary inequality: if a, b > 0 and 0 < τ < 1, then x ax τ + b, x > 0 = x ax(2a) 1/(1 τ), 2b. We find that (4.23) still holds. By the choice of λ = λ() = ( 1 )γ, one easily checks that η R c p,τ λ β ax (R 2 λ 1 β p ) 4 2τ+pτ, (R 2 λ 1 β ) p 26 2+p

27 for soe c p,τ > 0. But 4 2τ + pτ 2 + p, hence if R > λ (1 β)/2, then η R c p,τ λ β (R 2 λ 1 β ) p 2+p = cp,τ λ p+2β 2+p R 2p 2+p. (4.24) The choice of λ together with the assuption D(λ) c β λ β and t > 1 on the regularization error also iplies ( Cτ t ) 1/(2 τ) 38Bt + 3B λ t D(λ) c q,τ,β tλ β (4.25) for soe c q,τ,β > 0. Putting the estiates (4.25) and (4.24) into (4.23) we obtain z c pc p.τ λ p+2β 2p 2+p R 2+p + cq,τ,β tλ β, z W(R) V(R) (4.26) whenever R > λ (1 β)/2. This iplies that f z Σ z /λ g(r), where g : IR + IR + is a univariate function defined as g(r) = c pc p,τ λ β 1 p 2+p R 2+p + cq,τ,β tλ (β 1)/2. (4.27) It follows that W(R) V(R) W(g(R)), R > λ (1 β)/2. (4.28) Step 3: by iteration, find a sall ball B R that, with high confidence, contains f z. Lea 1 eans that W(R 0 ) = U 1 U 2 for R 0 = φ(0)/λ. When R 0 > λ (1 β)/2, we use our conclusion (4.28) iteratively. Denote g [0] (R) = R, g [1] (R) = g(r) and g [l] (R) = g ( g [l 1] (R) ) for l 2. According to (4.28), if g [j] (R) > λ (1 β)/2, j = 0, 1,, l 1, (4.29) then W(R) V(R) V(g [1] (R)) V ( g [l 1] (R) ) W(g [l] (R)). (4.30) Observe that g(r) = d 0 R p 2+p + d1 with d 0, d 1 > 0 given in (4.27). Then g [2] ( ) (R) = d 0 d0 R p 2+p + p 2+p d1 + d 1 d 1+ p 27 2+p 0 R ( p 2+p )2 + d 1 + d 0 d p 2+p 1,

28 and in general, for l IN, g [l] (R) d 1+ p 2+p + +( p 2+p )l 1 0 R ( p + + d 1+ p 2+p + +( p 2+p )l 2 p 2+p )l 2+p + d 1 + d 0 d1 + d 1+ p 0 d ( p 2+p )l 1 1. This in connection with the expressions for d 0 and d 1 gives g [l] (R) d 2+p 2 1 ( p 0 R ( p 2+p )l 2+p )l + c 2+p 4 0 λ (β 1) 2 1 ( p 2+p )l R ( l 1 d i=0 p 2+p )l + i 1 j=0 ( p 2+p )j 0 d ( p 1 l 1 i=0 2+p 0 d ( p 2+p )2 2+p )i c 2+p 4 0 (c 1 t) ( p 2+p )i λ (β 1) 2, where c 0 = ax1, c pc p,τ and c 1 = ax1, c q,τ,β. In particular, for R = R 0, there holds g [l] (R 0 ) c 2+p 4 0 λ (β 1)/2 (φ(0)) 1 2 ( p 2+p )l λ β 2 ( p 1 2+p )l + c 1 tl. For ɛ > 0, choose l 0 IN such that l 0 log 1 2+p 2ε / log p. Then ( 1 p l0 2 2+p) ε. It follows that g [l0] (R 0 ) c 2+p 4 0 λ (β 1)/2 (φ(0)) 1 2 ( when (4.29) with l = l 0 and R = R 0 holds. p 2+p )l 0 λ βɛ + c 1 tl 0 When (4.29) with l = l 0 and R = R 0 is not valid, we have g [j 0] (R 0 ) λ (β 1)/2 for soe j 0 0, 1,, l 0 1. Take l ɛ = l 0 when (4.29) with l = l 0 and R = R 0 holds and l ɛ = j 0 otherwise. In both cases, we have where c ε := c 2+p ( (φ(0)) 2 ( p 2+p )l 0 ) + c 1 tl 0. g [l ɛ] (R 0 ) c ε λ (β 1)/2 βε =: R ε (4.31) Take l = l ɛ l 0 and R = R 0 in (4.30). Since W(R 0 ) = U 1 U 2, we know that there is a subset V ε of Z with easure at ost l 0 e t such that U 1 U 2 W(R ε ) V ε. Then the easure of the set W(R ε ) is at least 1 (l 0 + 2)e t. 28

29 Apply (4.23) with R = R ε and notice (4.25). Let z W(R ε ) V(R ε ). We know that z c pη Rε + c q,τ,β tλ (β 1)/2. It is easy to check that η Rε c p,τ c ε ( 1 )θ. Therefore, with the constant c = c pc p,τ c ε +c q,τ,β t, there holds ( 1 ) θ. E φ (π(f z )) E φ (fρ φ ) z c Taking t = log l 0+3 δ, the easure of the set W(R ε ) V(R ε ) is at least 1 δ. Then Theore 5 is proved. 6. Extensions A key point of our analysis is to find essential bounds for penalty functional values of regularization schees. This approach can be extended to regularization schees with ore general loss functions and general penalty functionals. Let the hypothesis space H be a function set containing 0. It is assigned a functional Ω : H IR + satisfying Ω(0) = 0. Beyond the ulti-kernel space H Σ, such a hypothesis space is the linear prograing support vector achine classifier [38] in a one kernel setting with the penalty functional Ω(f) defined for f H = H K,z = α iy i K xi : α i 0 as Ω(f) = α i. Let Y be a subset of IR, and V : IR 2 IR + be a general loss function. The general regularization schee in H associated with V and the penalty functional Ω is defined for the saple z as f V z = arg in f H 1 V (y i, f(x i )) + λω(f). (5.1) All the results we obtained for the ulti-kernel regularized classifiers (1.6) can be established for the ore general schee (5.1) under the assuption that the pair (V, ρ) is M-adissible: there is a constant M > 0 such that y M alost surely with respect to ρ, and for each y [ M, M], V (y, t) is a convex function of the variable t IR satisfying V (y, t) V (y, M), t > M (5.2) V (y, t) V (y, M), t < M. 29

30 An iportant faily of regularization schees (5.1) are those for regression with a general loss function: take Y = IR and V (y, f(x)) = ψ(y f(x)) where ψ : IR IR + is even, convex and increasing on [0, + ) with ψ(0) = 0. If y M alost surely with respect to ρ, then (V, ρ) is M-adissible. Our approach can be used to analyze the convergence of Z V (y, f z V (x))dρ to inf f H V (y, f(x))dρ. Z Exaple 4. Let ε > 0. The ε-insensitive nor is the univariate loss function ψ used for regression defined [35] as ψ(t) = ax t ε, 0. It would be interesting to analyze the convergence of the schee (5.1) as ε tends to zero. For the classification algorith (1.6), soe of our error bounds can be extended to nonclassifying loss functions (such as the exponential loss), i.e., those activating loss functions whose infiu cannot be achieved. For this purpose, we need a ore general projection operator. Definition 8. For M > 0, the projection operator at level M is defined on the space of easurable functions f : X IR as M, if f(x) > M, π M (f)(x) = M, if f(x) < M, f(x), if M f(x) M. Using this projection operator, we can have siilar error decopositions by revising the regularization error and introducing level M adapting to the behavior of the loss function (the convergence rate of φ(t) as t ). Then soe learning rates can be obtained, following our approach. Acknowledgeent. When the paper is being revised as requested, we learn that a kernel-searching ethod, leading to the regularization schee (1.6), is studied recently in [22]. The learnability of ulti-kernel spaces associated with Gaussian kernels with flexible variances, i.e., Σ = (0, + ) in Exaple 3, is also verified recently in [39]. We thank the referees for their careful reading and constructive suggestions which help us iprove the paper. References 30

31 [1] N. Aronszajn, Theory of reproducing kernels, Trans. Aer. Math. Soc. 68 (1950), [2] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, Convexity, classification, and risk bounds, preprint, [3] B. Blanchard, O. Bousquet and P. Massart, Statistical perforance of support vector achines, preprint, [4] B. Blanchard, G. Lugosi and N. Vayatis, On the rate of convergence of regularized boosting classifiers, J. Mach. Learning Res. 4 (2003), [5] B. E. Boser, I. Guyon, and V. Vapnik, A training algorith for optial argin classifiers, in Proceedings of the Fifth Annual Workshop of Coputational Learning Theory 5 (1992), Pittsburgh, ACM, pp [6] O. Bousquet and A. Elisseeff, Stability and generalization, J. Mach. Learning Res. 2 (2002), [7] L. Breian, Arcing classifiers, (discussion paper) Ann. Stat. 26 (1998), [8] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, Choosing ultiple paraeters for support vector achines, Mach. Learning 46 (2002), [9] D. R. Chen, Q. Wu, Y. Ying and D. X. Zhou, Support vector achine soft argin classifiers: error analysis, J. Mach. Learning Res. 5 (2004), [10] C. Cortes and V. Vapnik, Support-vector networks, Mach. Learning 20 (1995), [11] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cabridge University Press, [12] F. Cucker and S. Sale, On the atheatical foundations of learning, Bull. Aer. Math. Soc. 39 (2001), [13] F. Cucker and S. Sale, Best choices for regularization paraeters in learning theory: On the bias-variance proble, Found. Coput. Math. 2 (2002), [14] F. Cucker and D. X. Zhou, Learning Theory: an Approxiation Theory Viewpoint, onograph anuscript in preparation for Cabridge University Press. 31

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007