Multi-kernel Regularized Classifiers

Size: px
Start display at page:

Download "Multi-kernel Regularized Classifiers"

Transcription

1 Multi-kernel Regularized Classifiers Qiang Wu, Yiing Ying, and Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong Kowloon, Hong Kong, CHINA Abstract A faily of classification algoriths generated fro Tikhonov regularization schees are considered. They involve ulti-kernel spaces and general convex loss functions. Our ain purpose is to provide satisfactory estiates for the excess isclassification error of these ulti-kernel regularized classifiers. The error analysis consists of two parts: regularization error and saple error. Allowing ulti-kernels in the algorith iproves the regularization error and approxiation error, which is one advantage of the ulti-kernel setting. For a general loss function, we show how to bound the regularization error by the approxiation in soe weighted L q spaces. For the saple error, we use a projection operator. The projection in connection with the decay of the regularization error enables us to iprove convergence rates in the literature even for the one kernel schees and special loss functions: least square loss and hinge loss for support vector achine soft argin classifiers. Existence of the optiization proble for the regularization schee associated with ulti-kernels is verified when the kernel functions are continuous with respect to the index set. Gaussian kernels with flexible variances and probability distributions with soe noise conditions are deonstrated to illustrate the general theory. Keywords and Phrases: Classification algorith, ulti-kernel regularization schee, convex loss function, isclassification error, regularization error and saple error. Supported by the Research Grants Council of Hong Kong [Project No. CityU ]. Corresponding author: Ding-Xuan Zhou. 1

2 1. Introduction We study binary classification algoriths generated fro Tikhonov regularization schees associated with general convex loss functions and ulti-kernel spaces. These algoriths produce binary classifiers f : X 1, 1, fro a copact etric space X (called input space) to the output space Y = 1, 1 (representing the two classes). Such a classifier f labels a class f(x) Y for each point x (when X IR n, x is a vector representing an event with each coponent corresponding to a specific easureent). The classifiers considered here have the for sgn(f), defined as sgn(f)(x) = 1 if f(x) 0 and sgn(f)(x) = 1 if f(x) < 0, induced by real-valued functions. These functions are solutions of soe optiization probles associated with a saple z = (x i, y i ), independently drawn according to a (unknown) probability distribution ρ on Z = X Y. The nature of such an optiization proble (called a Tikhonov regularization schee) is deterined by two objects: a loss function and a hypothesis space. Definition 1. A function φ : IR IR + is called an activating loss (function) for classification if it is convex, φ (0) < 0, and inf t IR φ(t) = 0. Typical exaples of activating loss includes the hinge loss φ h (t) = (1 t) + = ax1 t, 0 for SVM classification and the exponential loss φ exp (t) = e t for boosting. Let φ be an activating loss. For a real-valued function f, when sgn(f) is used for classification or prediction, the local error incurred for the event x and output y will be easured by the value φ(yf(x)). The average of local errors is defined as E φ (f) = φ(yf(x))dρ, called the error or generalization error. Z The convexity and the condition φ (0) < 0 tells that φ(yf(x)) > φ(0) > 0 when yf(x) < 0, i.e., when sgn(f)(x) predicts the class label y incorrectly. So local errors are possibly sall only if yf(x) > 0. Hence iniizing the generalization error is expected to lead to a function predicting the label satisfactorily. This gives the intuition that φ is adissible for classification probles, as verified by any exaples in practice. Since the generalization error involving the unknown distribution ρ is not coputable, 2

3 its discretization is used instead which, coputable in ters of the saple z, is defined as Ez φ (f) = 1 φ ( y i f(x i ) ) and called the epirical error. Regularized learning schees are ipleented by iniizing a penalized version of the epirical error over a set of functions, called a hypothesis space H, equipped with a functional Ω : H IR +. The penalty functional Ω reflects constraints iposed on functions fro the hypothesis space in various desirable fors. Definition 2. Given a function φ : IR IR + and a hypothesis space H together with a penalty functional Ω, the regularized classifier generated for a saple z Z is defined as sgn(f z ), where f z is a iniizer of the Tikhonov regularization schee 1 f z := arg in φ ( y i f(x i ) ) + λω(f). (1.1) f H Here λ is a positive constant called the regularization paraeter. It depends on : λ = λ(), and usually λ() 0 as becoes large. Reproducing kernel Hilbert spaces are often used as the hypothesis space in (1.1). They play an iportant role in learning theory because of their reproducing property. Let K : X X IR be continuous, syetric and positive seidefinite, i.e., for any finite set of distinct points x 1,, x l X, the atrix (K(x i, x j )) l i,j=1 seidefinite. Such a function is called a Mercer kernel. is positive The Reproducing Kernel Hilbert Space (RKHS) H K associated with the Mercer kernel K is defined (see [1]) to be the copletion of the linear span of the set of functions K x = K(x, ) : x X with the inner product, K given by K x, K y K = K(x, y). The reproducing property of H K is K x, f K = f(x), x X, f H K. (1.2) The classical soft argin classifiers [35] correspond to the schee (1.1) with H = H K : 1 f z = arg in φ(y i f(x i )) + λ f 2 K. (1.3) f H K In this paper we introduce a ulti-kernel setting where H is the union of a set of reproducing kernel Hilbert spaces (and Ω(f) is the infiu nor square of f). 3

4 Definition 3. Let K Σ = K σ : σ Σ be a set of Mercer kernels on X. The ulti-kernel space associated with K Σ is defined to be the union H Σ = σ Σ H K σ. For f H Σ, we take f Σ = inf f Kσ : f H Kσ, σ Σ. (1.4) Taking H Σ as the hypothesis space and Ω(f) = f 2 Σ in (1.1) leads to the following schee in the ulti-kernel space H Σ : f z = arg in f H Σ 1 φ ( y i f(x i ) ) + λ f 2 Σ. (1.5) The corresponding ulti-kernel regularized classifier is given by sgn(f z ). Denote ( H Kσ, Kσ ) as ( Hσ, σ ) for siplicity. The regularization schee in the ulti-kernel space H Σ can be rewritten as a two-layer iniization proble: 1 f z = arg in in σ Σ f H σ φ ( y i f(x i ) ) + λ f 2 σ. (1.6) It reduces to (1.3) when Σ contains only one eleent. Our study of general ulti-kernel schees is otivated by recent work on learning algoriths with varying kernels. In [8] support vector achines with ultiple paraeters are investigated. In [19, 24] ixture density estiation is considered and Gaussian kernels with variance σ 2 flexible on an interval [σ 2 1, σ 2 2] with 0 < σ 1 < σ 2 < + are used for deriving bounds. Approxiation properties of ulti-kernel spaces are studied in [44]. Multi-task learning algoriths involve kernels fro a convex hull of several Mercer kernels and spaces with changing nors, e.g. [16, 18]. The first natural concern about the optiization proble (1.5) or (1.6) is the existence of a iniizer. This is assured by the copactness of the index etric set Σ and the continuity of K σ for σ Σ in the next result following fro Proposition 1 given in Section 2. Theore 1. Let φ be an activating loss. If the index set Σ is a copact etric space, and for each pair (x, y), the function K σ (x, y) is continuous with respect to σ Σ, then a solution f z to the ulti-kernel schee (1.6) exists. 4

5 In particular, f z exists in the one-kernel setting (1.3). We shall assue the existence of the optiization proble (1.6) throughout the error analysis of ulti-kernel regularized classifiers, the ain goal of this paper. Let (X, Y) be the rando variable on X Y with the probability distribution ρ. The isclassification error for a classifier f : X Y is defined to be the probability of the event f(x ) Y R(f) = Prob f(x ) Y = P (Y = f(x) x)dρ X. (1.7) Here ρ X is the arginal distribution on X and P ( x) is the conditional distribution. Our target of error analysis is to understand how sgn(f z ) approxiates the Bayes rule, the best classifier with respect to the isclassification error: f c = arg inf R(f) with the infiu taken over all classifiers. Denote η(x) = P (Y = 1 x) and recall the regression function f ρ (x) = ydρ(y x) = P (Y = 1 x) P (Y = 1 x) = 2η(x) 1, x X. (1.8) Y Then the Bayes rule is given (e.g. [15]) by the sign of the regression function f c = sgn(f ρ ). Estiating the excess isclassification error X R(sgn(f z )) R(f c ) (1.9) for the ulti-kernel regularized classification algorith (1.6) is our ain purpose. For the one-kernel setting (1.3) and special choices of φ, the error analysis has been extensively investigated in the literature, especially when ρ is strictly separable (with a positive argin). Besides the hinge loss φ h corresponding to the SVM 1-nor soft argin classifier [35, 25, 28, 11, 37], exaples of loss functions include (1) φ q (t) = (1 t) q + for the SVM q-nor (q > 1) soft argin classifier, see [35, 20, 9]; (2) least square loss φ ls (t) = (1 t) 2, see e.g. [12, 15, 17, 23, 29, 31, 40]; (3) the exponential loss φ exp (t) = e t, see [40, 4]; (4) the logistic regression φ(t) = log(1 + e t ) or 1/(1 + e t ), see [40, 4]. For the error bounds, we will focus on activating loss functions achieving zeros, which allows us to provide a powerful analysis. 5

6 Definition 4. An activating loss is called a classifying loss if the infiu 0 can be achieved. It is called noralized if 1 is its inial zero. Exaples of classifying loss includes the hinge loss φ h, the q-nor loss φ q for SVM classification and the least square loss φ ls (t) = (1 t) 2. They are all noralized. Our error analysis will be done in Sections 3-5. It uses an error decoposition procedure for regularization schee introduced in [9, 38], by the aid of an iteration technique [30, 38] and a projection operator [9]. The convergence rates will be stated in ters of the saple size with proper choices of the regularization paraeter λ = λ() 0. Our analysis is powerful. It yields fast convergence rates. Let us deonstrate this by SVM. Assue X IR n and for soe s > n, the ulti-kernels K Σ satisfy sup K σ C s (X X) <. (1.10) σ Σ It eans that K σ : σ Σ is a set of C s Mercer kernels with a unifor bound. The convergence rate for SVM with such ulti-kernels can be stated as follows. Theore 2. Let φ = φ h and f z by (1.6). Assue for soe 0 < β 1 and c β > 0, inf inf f f c L 1 ρx + λ f 2 σ c β λ β, λ > 0. (1.11) σ Σ f H σ If (1.10) holds for soe s > n, choose λ() = ( ) 1 in 1 2β+(1 β)n/s, 2 1+β. For any ɛ > 0 and 0 < δ < 1, there exists a constant c independent of such that with confidence 1 δ, where θ = in β 2β+(1 β)n/s ɛ, 2β ( 1 ) θ, R(sgn(f z )) R(f c ) c (1.12) 1+β. In Theore 2, the condition (1.11) easures the approxiation power of the ultikernel space H Σ in L 1 ρ X, acting on the function f c. It can be described by soe interpolation spaces of the pair (H Σ, L 1 ρ X ). So only the sign of f ρ is involved in (1.11). If further inforation about the distribution ρ is available, one expects sharper error estiates. For exaple, when ρ satisfies a so-called Tsybakov noise condition ρ X (x X : 0 < f ρ (x) t) t ζ, t > 0, (1.13) 6

7 with soe ζ [0, ] and > 0, then the power θ in the error bound (1.12) can be β(ζ+1) iproved to θ = in β(ζ+2)+(ζ+1 β)n/s ɛ, 2β 1+β. This will be shown in Theore 6 below (in Section 5). Note that any distribution satisfies (1.13) with ζ = 0. The case ζ = is the sae as f ρ (x) or f ρ (x) = 0, eaning that the two classes are well separated. Our result is copletely new for the ulti-kernel setting. Even for the one-kernel setting H Σ = H K, Theore 2 provides the best convergence rate for the SVM under the sae assuption (1.11) of the approxiation power of H K and the regularity condition of the kernel (K C s with s > n): the capacity independent estiates derived by Zhang [40] yield the learning rate (1.12) with θ = β/(1 + β); under the noise condition (1.13), Steinwart and Scovel [30] obtained the learning rate (1.12) with θ = Since s > n, our rate is sharper than theirs. 2β(ζ+1) (2+ζ+ζn/s)(1+β) ɛ. 2. Optiization Proble for Regularization with Multi-kernels We divide the study of the optiization proble (1.6) in two steps. First, fix σ Σ. Denote the optial solution in the RKHS H σ as 1 f z,σ = arg inf f H σ φ(y i f(x i )) + λ f 2 σ. Define the dual function ψ : IR IR of φ by ψ(v) = sup vu φ(u), v IR. (2.1) u IR By the reproducing property (1.2), the optiization proble for solving f z,σ on H σ can be reduced into one on IR. The following relation between the prial proble and its dual is well known (see e.g. [41]): where 1 inf f H σ φ(y i f(x i )) + λ f 2 σ = sup α IR ˆR(α, σ), ˆR(α, σ) := 1 ψ( α i y i ) λ i,j=1 α i K σ (x i, x j )α j, α IR. 7

8 Moreover, both optiizers exist. If ˆα σ = arg ax α IR ˆR(α, σ), then sgn((ˆασ ) i ) = y i and f z,σ (x) = 1 2λ ) (ˆασ i K σ(x i, x). Next, consider the ulti-kernel schee (1.6). A solution f z can be represented as f z (x) = 1 2λ ˆα i Kˆσ (x i, x) if an optial point (ˆα, ˆσ) of the following dual proble exists: (ˆα, ˆσ) = arg in σ Σ ax α IR ˆR(α, σ). (2.2) We show that under soe ild condition, (2.2) can be solved. Proposition 1. Under the conditions of Theore 1, an optial point (ˆα, ˆσ) of (2.2) can be achieved. Hence an optial solution f z to the ulti-kernel regularization schee (1.5) always exists. Proof. We first clai that there exists a constant C(φ, ) depending on φ and the saple size such that ˆα σ l (IR ) C(φ, ), σ Σ. (2.3) To verify our clai, recall that ˆα σ is a axiizer of ˆR(α, σ). This yields ˆR(ˆα σ, σ) ˆR(0, σ) = ψ(0) = sup 0 φ(u) = inf φ(u) = 0. u IR u IR But K σ is positive seidefinite, it follows that ψ ( (ˆα σ ) i y i ) = ˆR(ˆασ, σ) 1 4λ However, for each v IR, (ˆα σ ) i K σ (x i, x j )(ˆα σ ) j ˆR(ˆα σ, σ) = 0. i,j=1 Therefore, for each i 1,,, we have ψ ( (ˆα σ ) i y i ) 0 ψ( v) = sup uv φ(u) φ(0). u IR j i ψ( (ˆα σ ) j y j ) j i 8 φ(0) = ( 1)φ(0). (2.4)

9 Now we prove our clai in two cases. Recall that the convexity of φ iplies that the one-side derivatives φ + and φ exist, are nondecreasing, and satisfy φ (t) φ +(t) for any t IR. Case 1: φ +(t) 0 for each t IR. In this case, φ is nonincreasing and li u + φ(u) = inf u IR φ(u) = 0. This in connection with the definition of the dual function iplies ψ( v) = sup uv φ(u) li uv = +, v < 0. (2.5) u IR u + It follows fro (2.4) that (ˆα σ ) i y i 0 for each i. Definition 1 also tells us that φ is strictly decreasing on (, 0] and li t φ(t) = +. Then the inverse function φ 1 is well defined on [φ(0), + ). Choose u = φ 1 ( v) for v (φ(0)) 2 in the definition of ψ, we see that ψ( v) vφ 1 ( v) φ ( φ 1 ( v) ). It follows that for any v ax 1, ( φ( 2) ) 2 there holds ψ( v) v vφ 1 ( v) 1 v. Hence v ax1, ( φ( 2) ) 2, ( ψ( v) ) 2, v IR. Cobining with (2.4), this iplies that (ˆα σ ) i y i ax 1, ( φ( 2) ) 2, ( 1) 2 ( φ(0) ) 2 =: C 1 (φ, ). As y i = ±1 and sgn((ˆα σ ) i ) = y i, we know that (ˆασ ) i = (ˆασ ) i y i = (ˆασ ) i y i C 1 (φ, ) for each i. This proves our clai in Case 1: ˆα σ l (IR ) C 1 (φ, ). Case 2: φ +(t 0 ) > 0 for soe t 0 IR. In this case, t 0 > 0 and φ is strictly increasing on [t 0, + ). Then for v in 1, ( φ(t 0 +2) ) 2, there exists soe uv t 0 +2 such that φ(u v ) = v. Choosing u = u v in the definition of ψ, we see that ψ( v) u v v φ(u v ) can be bounded fro below as ψ( v) v v(t 0 + 2) 1 v, v in 1, ( φ(t 0 + 2) ) 2. (2.6) On the other hand, since φ is strictly decreasing on (, 0], for v ax 1, ( φ( 2) ) 2 there exists soe u v 2 such that φ(u v ) = v. It follows that ψ( v) u v v φ(u v ) v ( ) 2 u v v 1 = v, v ax 1, φ( 2). 9

10 This in connection with (2.6) iplies that ψ( v) > ( 1)φ(0) whenever v > ax ( 1) 2( φ(0) ) 2 (, φ(t0 + 2) ) 2 ( ) 2, 1, φ( 2) =: C 2 (φ, ). Cobining with (2.4), we see again that (ˆασ ) i = ˆασ,i y i C2 (φ, ) for each i 1,,. This proves our clai in Case 2: ˆα σ l (IR ) C 2 (φ, ). Therefore, (2.3) holds with C(φ, ) = ax C 1 (φ, ), C 2 (φ, ). Next, we apply our clai (2.3) to prove the proposition. Denote Ĝ(σ) = ax α IR ˆR(α, σ) = ˆR(ˆασ, σ). To prove the existence of a solution (ˆα, ˆσ) = (ˆαˆσ, ˆσ) to the proble (2.2), it is sufficient to prove that the function Ĝ(σ) is continuous on the copact etric space ( Σ, d Σ ). Let σ 1, σ 0 Σ. By the definition of Ĝ(σ) and ˆR(α, σ), we have Ĝ(σ 1 ) Ĝ(σ 0) = ˆR(ˆα σ1, σ 1 ) ˆR(ˆα σ0, σ 0 ) ˆR(ˆα σ1, σ 1 ) ˆR(ˆα σ1, σ 0 ) = 1 ) 4 2 (ˆα σ1 ) i (K σ0 (x i, x j ) K σ1 (x i, x j ) (ˆα σ1 ) j. λ By syetry, there holds Ĝ(σ 0 ) Ĝ(σ 1) λ i,j=1 i,j=1 ) (ˆα σ0 ) i (K σ1 (x i, x j ) K σ0 (x i, x j ) (ˆα σ0 ) j. By the continuity of K σ (x i, x j ) at σ 0 for each pair (i, j), we know that for any ε > 0, there exists soe δ > 0 such that Kσ1 (x i, x j ) K σ0 (x i, x j ) 4λε/ ( C(φ, ) ) 2 whenever d Σ (σ 1, σ 0 ) < δ. It follows fro (2.3) and the above two bounds that Ĝ(σ 1 ) Ĝ(σ 0) ε. This shows the continuity of Ĝ at σ 0. Since σ 0 is an arbitrary point in Σ, Ĝ(σ) is continuous on Σ. Therefore, a iniizer of Ĝ(σ) in Σ exists: ˆσ = arg inf σ Σ Ĝ(σ). Thus, inf ax σ Σ α ˆR(α, σ) = inf Ĝ(σ) = Ĝ(ˆσ) = ax σ Σ α ˆR(α, ˆσ). Moreover the axiizer of ˆR(α, ˆσ) always exists. This tells us that the general optiu of ˆR(α, σ) is achievable. By the the relationship between the prial proble and its dual, we obtain the existence of the ulti-kernel regularization schee (1.5). This copletes the proof of the proposition. Exaple 1. Let Σ = [σ 1, σ 2 ] with 0 < σ 1 σ 2 < and K σ be the Gaussian kernel K σ (x, y) = exp x y 2 2σ 2 on a copact subset X of IR n. Then a solution to the optiization proble (1.6) exists. 10

11 3. Error Analysis: A General Fraework In this section, we give a general fraework of our error analysis, consisting of a coparison theore, a projection operator and an error decoposition procedure. It provides bounds for the excess isclassification error in ters of a regularization error and a saple error, studied in the next two sections separately Coparison Theores Siilar to the learning rate stated in Theore 2, the error analysis ais at bounding the excess isclassification error R(sgn(f z )) R(f c ). But the algorith is designed by iniizing a penalized epirical error Ez φ associated with the loss function φ. Knowledge on regularization schees or epirical risk iniization processes would only lead us to expect the convergence of E φ (f z ) as. So relations between isclassification error and generalization error becoe crucial. Soe works have been done on this topic [4, 40, 2]. Here we only ention soe coparison theores which will be used in the paper. Denote IR = IR ±. Define f φ ρ = arg in E φ (f) with the iniu taken over all functions f : X IR. Note that fρ φ always exists since φ is convex. It satisfies sgn(fρ φ ) = f c, an adissible condition for the loss function, see [29, 2]. The first coparison theore is for the hinge loss φ h (t) = (1 t) +. Proposition 2. Let φ = φ h be the hinge loss. We have f φ h ρ function f : X IR, = f c and for every easurable R(sgn(f)) R(f c ) E φ h (f) E φ h (f c ). (3.1) The fact f c = f φ h ρ was proved in [36]. The relation (3.1) was proved in [40]. The following coparison theore for general activating loss functions was given in [9]. Note that the convexity of φ iplies φ (0) 0. 11

12 Proposition 3. If an activating loss φ satisfies φ (0) > 0, then there exists a constant c φ > 0 such that for any easurable function f : X IR, there holds R(f) R(f c ) c φ E φ (f) E φ (f φ ρ ). Tighter coparison bounds are possible under soe noise conditions. We say that ρ has a Tsybakov noise exponent α 0 if for soe c α > 0 and every easurable f : X Y, ρ X (x X : f(x) f c (x)) c α (R(f) R(f c )) α. (3.2) All distributions satisfy (3.2) with α = 0 and c α = 1. The following sharper coparison bound for α > 0 follows iediately fro [4, Lea 6] and Proposition 3. Corollary 1. Let φ be a classifying loss satisfying φ (0) > 0. If ρ satisfies the Tsybakov noise condition (3.2) for soe α [0, 1] and c α > 0, then R(sgn(f)) R(f c ) 2c φ c α ( E φ (f) E φ (f φ ρ )) 1/(2 α), f : X IR Projection Operator By coparison theores, we only need to bound the excess generalization error E φ (f z ) E φ (fρ φ ) in order to study the perforance of the classifier sgn(f z ). But we can do better using the special feature of a classifying loss that it achieves a zero. A key technical tool here is a projection operator. To siply the notations and stateents, we will restrict our discussion only for noralized classifying loss functions. Firstly we show that the target function fρ φ can be chosen to be bounded. Set a univariate convex function Q for x X as Q(t) = Q x (t) := φ(yt)dρ(y x), t IR. (3.3) Y Its one-side derivatives exist, are nondecreasing and satisfy Q (t) Q +(t) for every t IR. Denote f ρ (x) = sup t IR : Q (t) < 0, f + ρ (x) = inf t IR : Q +(t) > 0. 12

13 Theore 3. Let φ be a noralized classifying loss function. Then (a) for each x X, the univariate function Q given by (3.3) is strictly decreasing on (, fρ (x)], strictly increasing on [f ρ + (x), + ), and is constant on [fρ (x), f ρ + (x)]. (b) fρ φ : X IR is a iniizer of the generalization error E φ (f) if and only if for alost every x (X, ρ X ), fρ φ (x) is a iniizer of Q, that is, there holds fρ (x) fρ φ (x) f ρ + (x). (3.4) (c) we ay choose a iniizer fρ φ of E φ satisfying fρ φ [ 1, 1] for each x X. Proof. Let x X. Consider the univariate continuous function Q given by (3.3). It is strictly decreasing on the interval (, fρ (x) ), since Q (t) < 0 on this interval. In the sae way, Q +(t) > 0 for t > f ρ + (x), so Q is strictly increasing on (f ρ + (x), + ). For t (fρ (x), f ρ + (x)), we have 0 Q (t) Q +(t) 0, hence Q is constant which is the inial value of Q on IR. This proves (a). Since E φ (f) = X Q x(f(x))dρ X (x), the stateent (b) follows directly fro (a). By the assuption, φ is convex and has inial zero 1. This iplies that φ is strictly decreasing on (, 1] and nondecreasing on [1, + ). So Q(t) Q(1) for t > 1 and Q(t) Q( 1) for t < 1. So a iniu of Q can always be achieved on [ 1, 1]. Hence we ay choose fρ φ such that fρ φ (x) [ 1, 1]. This proves the stateent (c). In what follows we shall always choose fρ φ with fρ φ (x) 1 for noralized classifying loss functions. Then we can ake full use of the projection operator introduced in [9]. Definition 5. The projection operator π is defined on the space of easurable functions f : X IR as 1, if f(x) > 1, π(f)(x) = 1, if f(x) < 1, (3.5) f(x), if 1 f(x) 1. It is easy to see that π(f) and f induce the sae classifier, i.e., sgn(π(f)) = sgn(f). Apply this fact to coparison theores. It is sufficient for us to bound the excess generalization error for π(f z ) instead of f z. This leads to better estiates, as we will see later. The following property of projection operator is iediate fro the definition of φ. 13

14 Proposition 4. If φ is noralized classifying loss function, then there holds alost surely φ(yπ(f)(x)) φ(yf(x)). (3.6) Hence for any easurable function f, we have E φ (π(f)) E φ (f) and E φ z (π(f)) E φ z (f) Error Decoposition Now we can present the error decoposition which leads to bounds of the excess generalization error for π(f z ). Define f λ = arg in E φ (f) + λ f 2 Σ. f H Σ Proposition 5. Let φ be a noralized classifying loss and f z given by (1.6). Then E φ (π(f z )) E φ (f φ ρ ) + λ f z 2 Σ D(λ) + S z,λ, (3.7) where D(λ) is the regularization error of the ulti-kernel space H Σ defined [27] as D(λ) = inf inf σ Σ f H σ E φ (f) E φ (f φ ρ ) + λ f 2 σ (3.8) and S z,λ = E φ (π(f z )) Ez φ (π(f z )) + Ez φ (f λ ) E φ (f λ ). (3.9) Proof. Write E φ (π(f z )) E φ (f φ ρ ) + λ f z 2 Σ as E φ (π(f z )) Ez φ (π(f z )) + Ez φ (f λ ) E φ (f λ ) + (E φ + z (π(f z )) + λ f z 2 ( Σ) E φ z (f λ ) + λ f λ 2 ) Σ E φ (f λ ) E φ (f φ ρ ) + λ f λ 2 Σ. By Proposition 4, E φ z (π(f z )) E φ z (f z ). This in connection with the definition of f z tells us that the second ter is 0. Note that S z,λ is just the su of the first and third ters. By the definition of f λ, the last ter equals to D(λ). This proves (3.7). The regularization error ter D(λ) in the error decoposition (3.7) is independent of the saple. It can be estiated by K-functionals by the discussion in Section 4. 14

15 The last ter S z,λ in (3.7) is called the saple error. Without projection, it is well understood because of the vast literature in learning theory. We are able to iprove the saple error estiates, stated in Theore 5 below, because of the projection operator. Coparison theores and the error decoposition help switch the goal of the error analysis to the estiation of the regularization error and the saple error. For instance, to prove Theore 2, we first apply Proposition 2 to π(f z ) and then Proposition 5. It tells us that R(sgn(f z )) R(f c ) is bounded by the su of S z,λ and D(λ). 4. Estiating Regularization Error and Approxiation Error In this section, we discuss the estiation of the regularization error. The convexity of φ iplies that φ (t) = φ +(t) = φ (t) for alost every t IR. Theore 4. Let φ be a noralized classifying loss. Then E φ (f) E φ (f φ ρ ) φ L [ f, f ] f f φ ρ L 1 ρx. If oreover, φ is C 1 and φ is absolutely continuous on IR, we have E φ (f) E φ (f φ ρ ) φ L [ f 1, f +1] f f φ ρ 2 L 2 ρ X. Proof. With the function Q = Q x defined in (3.3), write E φ (f) E φ (fρ φ ) as E φ (f) E φ (fρ φ ) = Q(f(x)) Q(f φ ρ (x)) dρ X. X Since φ (0) < 0 and φ(t) 0, we have φ(0) > 0 and φ ±(t) < 0 for t < 0. Let P (t) = ax φ ±(t), φ ±( t) for t > 0. We only need to prove Q(f(x)) Q(f φ ρ (x)) P ( f(x) ) f(x) f φ ρ (x) (4.1) for those x with Q(f(x)) Q(fρ φ (x)) > 0. According to Theore 3, such a point x satisfies f(x) [fρ (x), f ρ + (x)]. If f(x) > f ρ + (x), then Q is strictly increasing on [f(x), + ). Hence f(x) > fρ φ (x). By Theore 3, we have Q(f(x)) Q(f φ ρ (x)) Q (f(x)) ( f(x) f φ ρ (x) ). 15

16 Note that both φ and φ + are nondecreasing, and Q(t) = η(x)φ(t) + (1 η(x))φ( t). Hence Q (f(x)) = η(x)φ (f(x)) (1 η(x))φ +( f(x)) ax φ ±( f(x) ), φ ±( f(x) ) no atter whether f(x) 0 or not. Thus, (4.1) holds true when f(x) > f + ρ (x). In the sae way, if f(x) < f ρ (x), then Q is strictly decreasing on (, f(x)]. Hence f(x) < f φ ρ (x). Theore 3 yields again Q(f(x)) Q(f φ ρ (x)) Q +(f(x)) ( f φ ρ (x) f(x) ). Since Q +(f(x)) = η(x)φ +(f(x)) + (1 η(x))φ ( f(x)) P ( f(x) ), we see that (4.1) also holds when f(x) < f ρ (x). This proves the first stateent. If φ is C 1 and φ is absolutely continuous on IR, we know fro Theore 3 that Q (f φ ρ (x)) = 0. Hence Q(f(x)) Q(f φ ρ (x)) = f(x) f φ ρ (x) Q (u) Q (fρ φ (x))du Q L (I) f(x) fρ φ (x) 2 2 where I is the interval between f φ ρ (x) and f(x). Then the second stateent follows. In the above, L q ρ X is the L q space with nor f L q ρx = 1/q. X f(x) q dρ X Thus, we can use the rich knowledge fro approxiation theory to estiate the regularization error. See [9] for details on bounding the regularization error for the SVM q-nor soft argin classifiers by eans of K-functionals in L q ρ X. One advantage of ulti-kernel algoriths is the iproveent of regularization errors copared with the one-kernel setting. For exaples and discussion, see [44, 30, 26]. 5. Saple Error Estiates and Learning Rates We are in a position to estiate the saple error and derive the learning rates. Throughout this section, we assue that the kernels are uniforly bounded in the sense that κ := sup K C(X X) <. (4.2) σ Σ 16

17 To state our result, we need to further introduce several concepts and notations. The quantity E φ (π(f z )) E φ z (π(f z )) in the saple error (3.9) needs to be estiated by soe unifor law of large nubers. To this end, we need the capacity of the hypothesis space, which plays an essential role in saple error estiates. In this paper, we use the covering nubers easured by epirical distances. Definition 6. Let F be a set of functions on Z and z = z 1,, z Z. The etric d 2,z is defined on F by 1 d 2,z (f, g) = ( f(zi ) g(z i ) ) 2 1/2. For every ε > 0, the covering nuber of F with respect to d 2,z is defined as N 2,z (F, ε) = inf l IN : f i l F such that F = l f F : d 2,z (f, f i ) ε. The function sets in our situation are balls of the ulti-kernel space in the for of B R = f H Σ : f Σ R = σ Σ f H σ : f σ R. We need the epirical covering nuber of B 1 defined as ( ) N (ε) = sup sup N 2,x B 1, ε. (4.3) IN x X For a function f : Z IR, denote IEf = f(z)dρ. Z Theore 5. Let φ is a noralized classifying loss. Assue the following conditions with exponents q > 0, τ [0, 1] and p (0, 2): (1) an increent condition for φ with a constant c q > 0 φ(t)) c q t q, t 1, (4.4) (2) a variance-expectation bound for the pair (φ, ρ) with the exponent τ and soe c τ > 0 (φ(yf(x)) IE φ(yf φ ρ (x)) ) 2 τ c τ E φ (f) E φ (fρ ) φ, f 1, (4.5) (3) a capacity condition for the function set B 1 with a constant c p > 0 log N (ε) c p ( 1 ε ) p, ε, R > 0, IN. (4.6) 17

18 If D(λ) c β λ β for soe 0 < β 1 and c β > 0, then for any ɛ > 0 and 0 < δ < 1, there exists a constant c independent of such that, with λ = λ() = ( 1 ) γ, we have with confidence 1 δ, where E φ (π(f z )) E φ (fρ φ ) c ( 1 ) θ (4.7) 2 γ = in β(4 2τ + pτ) + p(1 β), 2 2β + q βq θ = in 2β β(4 2τ + pτ) + p(1 β) ɛ, 2β 2β + q βq The proof of Theore 5 will be given at the end of this section. Theore 5, let us reark the assuptions. (4.8). (4.9) Before applying The increent condition (4.4) is satisfied for any useful loss functions including the hinge loss and least square loss. The variance-exponent condition (4.5) for the pair (φ, ρ) always holds for τ = 0 with c τ = (axφ( 1), φ(1)) 2. This can be seen fro the fact that φ(yf(x)) φ(yf φ ρ (x)) axφ( 1), φ(1). Larger exponents τ are possible when φ has high convexity (such as φ ls in Theore 7 below) or when the distribution ρ satisfies soe conditions (such as the Tsybakov noise condition (1.13) in Theore 6 below). The capacity condition (4.6) always holds with p 2 if K Σ contains only one kernel. Note that for any function set F C(X), the epirical covering nuber N 2,x ( F, ε ) is bounded by N ( F, ε ), the (unifor) covering nuber of F under the etric, since d 2,x (f, g) f g. So in the ulti-kernel setting, the behavior of the covering nuber N (ε) can be estiated by the unifor soothness of kernels in Σ according to [43]. Exaple 2. If the set Σ of kernels on X IR n satisfies (1.10) for soe s > 0, then there is a constant c s > 0 such that log N (ε) c s ( 1/ε ) 2n/s for any ε > 0. The regularization error D(λ) decays to zero once H Σ is dense in C(X). By the discussion in Section 4, the decay rate with an exponent β can be estiated if soe priori knowledge on the distribution is available; see [9] for explicit exaples. Let us now show how to apply Theore 5 to derive learning rates. Recall Proposition 3 and Corollary 1. A direct corollary of Theore 5 is as follows. 18

19 Corollary 2. Under the assuption of Theore 5, if φ (0) > 0, then for any ɛ > 0 and 0 < δ < 1, there is a constant c independent of such that with confidence 1 δ, ( 1 ) θ/2 R(sgn(f z )) R(f c ) c (4.10) where λ = ( 1 ) γ, γ, θ are given by (4.8) and (4.9), respectively. If, in addition, ρ satisfies the noise condition (3.2) with 0 < α 1, the power θ 2 in (4.10) can be iproved to 1 2 α θ. Next we consider two classical classification algoriths: SVM classification and least square ethod Learning Rates for the SVM Classification For the SVM classification with the hinge loss, we illustrate how noise conditions on the distribution ρ raise the variance-expectation exponent τ in (4.5) fro 0 (for general distributions) to τ = ζ/(ζ + 1) > 0. Theore 6. Let φ = φ h and the ulti-kernels K σ : σ Σ satisfy (4.6). Assue inf inf E φ h (f) E φ h (f c ) + λ f 2 σ c β λ β, λ > 0 (4.11) σ Σ f H σ with 0 < β 1, c β > 0, and that ρ satisfies the noise condition (1.13) with ζ [0, ] and > 0. Choose λ = λ() = ( 1 )in 2(ζ+1) β(ζ+2)+p(ζ+1 β), 2 β+1. For any ɛ > 0 and 0 < δ < 1, there exists a constant C ɛ > 0 independent of such that with confidence 1 δ, ( θ 1 2β(ζ + 1) R(sgn(f z )) R(f c ) C ɛ, θ = in ) 2β(ζ + 2) + p(ζ + 1 β) ɛ, 2β 1 + β. Proof. Observe that φ h satisfies the increent condition (4.4) with q = 1 and c q = 2. Because of the noise condition (1.13), we know fro [30] and [38] that the condition (4.5) is valid with the exponent τ = ζ ζ+1 and the constant c τ = 8 ( 1 2 ) ζ/(ζ+1). Then the conclusion follows fro Theore 5 and Proposition 2. Theore 2 stated in the introduction is a special case of Theore 6 with ulti-kernels having a unifor bound in C s. Proof of Theore 2. By Exaple 2, (4.6) holds with p = 2n/s. Since φ h is Lipschitz, Theore 4 yields E φ h (f) E φ h (f c ) f f c L 1 ρx. Hence (1.11) iplies (4.11). Take ζ = 0 since no assuption on the noise is ade. We see Theore 2 follows fro Theore 6. 19

20 5.2. Learning Rates with the Least-square Loss Consider the least-square loss φ ls (t) = (1 t) 2 investigated in [31]. We illustrate how high convexity of the loss function yields large variance-expectation exponent τ in (4.5). Here φ ls (yf(x)) = (1 yf(x)) 2 = (y f(x)) 2 since y 2 = 1 for y Y. So we know [35] that f φ ρ = f ρ and the high convexity of φ ls ensures [12] that (4.5) holds true with τ = 1 and C τ = 1. The increent condition (4.4) for φ ls is true with q = 2. Moreover, E φ ls (f) E φ ls (f ρ ) = f f ρ 2 L 2 ρ X. Putting all these into Proposition 3 and Corollary 2, we obtain the following learning rate. Theore 7. Consider (1.6) with φ = φ ls and and ulti-kernels K σ : σ Σ satisfying (4.6) with soe p (0, 2). Assue that for soe 0 < β 1 and c β > 0, inf inf f f ρ 2 L + λ f 2 2 ρ σ c β λ β, λ > 0. (4.12) X σ Σ f H σ Then by choosing λ = λ() = ( 1 )in 2 2β+p,1, for any ɛ > 0 and 0 < δ < 1, there exists a constant C ɛ independent of such that with confidence 1 δ, R(sgn(f z )) R(f c ) C ɛ ( 1 ) θ with θ = 1 2β 2 in 2β + p ɛ, β. (4.13) If oreover, ρ satisfies (3.2), then θ can be iproved to 1 2 α in 2β 2β+p. ɛ, β In particular, when inf x X f ρ (x) > 0, (4.13) holds with θ = in 2β 2β+p. ɛ, β The above learning rate is better than those in the literature, e.g. [13, 23, 6, 40]. When the kernels are C with (1.10) valid for any s > 0, we ay take p in Theore 7 to be arbitrarily sall and the power θ in (4.13) becoes in1/2 ɛ, β/2. Exaple 3. Let φ(t) = (1 t) 2, Σ = [σ 1, σ 2 ] with 0 < σ 1 σ 2 < and K σ be the Gaussian kernel K σ (x, y) = exp x y 2 on X IR n. Assue (4.12). Let ɛ > 0 and λ = λ() = ( 1 ) in 1 2σ 2 β ɛ,1. Then with confidence 1 δ, we have ( 1 ) θ/2, R(sgn(f z )) R(f c ) c θ = in 1 ɛ, β. If ρ satisfies the noise condition (3.2) with 0 < α 1, then θ/2 can be iproved to 1 2 α θ = 1 2 α in 1 ɛ, β. When inf x X f ρ (x) > 0, we can replace θ/2 by in1 ɛ, β. 20

21 5.3. Proof of the Main Result To end this section, we prove our ain result, Theore 5. To this end, we shall use the following concentration inequality. Proposition 6. Let F be a set of easurable functions on Z, and B, c > 0, τ [0, 1] be constants such that each function f F satisfies f B and IE(f 2 ) c(ief) τ. If for soe a > 0 and p (0, 2), sup IN sup log N 2,z (F, ε) aε p, ε > 0, (4.14) z Z then there exists a constant c p depending only on p such that for any t > 0, with probability at least 1 e t, there holds where IEf 1 f(z i ) 1 ( ct ) 1/(2 τ) 2 η1 τ (IEf) τ + c 18Bt pη + 2 +, f F, η := ax c 2 p 4 2τ+pτ ( a ) 2 4 2τ+pτ To prove Proposition 6, we need soe preparations. (, B 2 p 2+p a 2 2+p. ) Definition 7. A function ψ : IR + IR + is sub-root if it is non-negative, non-decreasing, and if ψ(r)/ r is non-increasing. For a sub-root function ψ and any D > 0, the equation ψ(r) = r/d has a unique positive solution. The following proposition is given in [3, Theore 3]. Proposition 7. Let F be a class of easurable, square integrable functions such that IEf f b for all f F. Let ψ be a sub-root function, D be soe positive constant and r be the unique solution to ψ(r) = r/d. Assue that [ IE ax 0, sup IEf 1 f F IEf 2 r ] f(z i ) ψ(r), r r. Then for all t > 0, and all K > D/7, with probability at least 1 e t there holds IEf 1 f(z i ) IEf 2 K + 50K D 2 r + 21 (K + 9b)t, f F.

22 We need to find the sub-root function ψ in our setting. To this end, introduce the Radeacher variables ε i, i = 1,,. Then [ IE sup f F IEf 2 r IEf 1 ] f(z i ) [ 2IE sup f F IEf 2 r 1 ] ε i f(z i ). (4.15) The right hand side is called the local Radeacher process. It can be bounded by using epirical covering nubers and the entropy integral. See [34]. is given. The following result is a scaled version of Proposition 5.4 in [30] where the case B = 1 Proposition 8. Let F be a class of easurable functions fro Z to [ B, B]. Assue (4.14) for soe p (0, 2) and a > 0. Then there exists a constant c p depending only on p such that [ IE sup f F IEf 2 r 1 ] ε i f(z i ) c p ax r 1/2 p/4( a ) 1/2, ( 2 p B 2+p a 2/(2+p). ) According to Proposition 8 and (4.15), in applying Proposition 7, one should take ψ(r) = 2c p ax r 1/2 p/4( a ) 1/2, ( 2 p B 2+p a 2/(2+p). (4.16) ) Then the solution r to the equation ψ(r) = r/d satisfies ( r ax (2c p D) 4 2+p, 2cp DB 2 p 2+p a 2 2+p. (4.17) ) Proof of Proposition 6. Let ψ be defined by (4.16) and r be the solution to ψ(r) = r/d. Since f B, we have IEf f b := 2B for each f F. Choose K = D/5. By Proposition 7 and the condition IEf 2 c(ief) τ we know that with probability at least 1 e t there holds IEf 1 f(z i ) 5c D (IEf)τ + 10 D r + ( D B)t, f F. (4.18) Then 5c D Recall that r satisfies (4.17). Take D = 10cη τ 1 where η is given in our stateent. = 1 2 η1 τ. The expression of η in connection with the bound (4.17) for r tells 22

23 us that 10 D r c p η where c p is a constant depending only on p and c p, hence only on p. Observe fro the choice of D that Dt 5 = 2ct η 1 τ ( ct ) 1/(2 τ) 2 ax η,, according to whether η ( ct ) 1/(2 τ). Take c p to be the constant c p + 2 depending only on p. Then the desired inequality holds for each f F. This proves Proposition 6. We now turn to our key analysis and prove Theore 5. Let us first explain our ain ideas. In the saple error ter of (3.7), the quantity Ez φ (f λ ) E φ (f λ ) is easy to handle. It can be estiated by the one-side Bernstein inequality for the single rando variable φ(yf λ (x)) on Z. This will be done in the first step of the proof with a ild technical odification: consider the rando variable ξ = φ(yf λ (x)) φ(y, fρ φ (x)) instead of φ(yf λ (x)). The quantity E φ (π(f z )) Ez φ (π(f z )) is ore difficult and we need Proposition 6 to estiate. Here the function set will be F = φ ( yπ(f)(x) ) φ ( yfρ φ (x) ) : f B R with such a radius R that B R contains f z, i.e., R is a bound of f z Σ. On the other hand, saller radius R yields better estiates. Hence good bounds for f z Σ play an iportant role for the saple error estiates. A rough bound for f z Σ iediately follows fro the definition of f z. By choosing f = 0, we find λ f z 2 Σ E z φ (f z ) + λ f z 2 Σ E z φ (0) + λ 0 = φ(0). This proves Lea 1. For every λ > 0, there holds f z Σ φ(0)/λ. We ay use the bound φ(0)/λ as R in F and apply Proposition 6 to get soe rough estiates for E φ (π(f z )) Ez φ (π(f z )). However, the epirical error Ez φ (f) is a good approxiation of the generalization error E φ (f). Hence the penalty value f z Σ is expected to be close to f λ Σ which is bounded by D(λ)/λ: λ f λ 2 Σ E φ (f λ ) E φ (fρ φ ) + λ f λ 2 Σ = D(λ). (4.19) This expectation will be realized by an iteration technique used in [30] and [38]. By this technique, we shall show under soe assuptions that with high confidence f z Σ has a bound arbitrarily close to D(λ)/λ (in the order of λ). 23

24 We are in a position to estiate the saple error and prove Theore 5. Proof of Theore 5. Write the saple error as ( S z,λ = E φ (π(f z )) E φ (f φ + ρ ) ) ( ) Ez φ (π(f z )) Ez φ (fρ φ ) ( ) ( Ez φ (f λ ) Ez φ (fρ φ ) E φ (f λ ) E φ (fρ )) φ := S 1 + S 2. We divide our estiation into three steps. Take t 1 which will be deterined later. Denote B = axφ( 1), φ(1). Denote Step 1: estiate S 2. Consider the rando variable ξ = φ(yf λ (x)) φ(yf φ ρ (x)) on Z. ξ = ξ 1 + ξ 2 = φ(yf λ (x)) φ(yπ(f λ )(x)) + φ(yπ(f λ )(x)) φ(yfρ φ (x)). First we bound ξ 1. By (1.2), (4.2) and (4.19), we have f λ κ f λ Σ κ D(λ)/λ. We ay assue the last quantity to be greater than one since otherwise ξ 1 0. Then the increent condition on φ tells us 0 ξ 1 B λ := c q κ q( D(λ)/λ ) q/2. Hence ξ1 IE(ξ 1 ) B λ. Applying the one-side Bernstein inequality to ξ 1, we know that for any ε > 0, 1 Prob Solving the quadratic equation ξ 1 (z i ) IEξ 1 > ε exp ε 2 2 ( σ 2 (ξ 1 ) B λε ) = t ε 2 2 ( σ 2 (ξ 1 ) B λε ) for ε, we see that there exists a subset U 1 of Z with easure at least 1 e t such that for every z U 1,. 1 ξ 1 (z i ) IEξ 1 1 ( 3 B λt B λt ) 2 + 2σ2 (ξ 1 )t 2B λt 2t 3 + σ2 (ξ 1 ). But the fact 0 ξ 1 B λ iplies σ 2 (ξ 1 ) B λ IE(ξ 1 ). Therefore, we have 1 ξ 1 (z i ) IEξ 1 7B λt 6 + IEξ 1, z U 1. 24

25 Next we consider ξ 2. Since both yπ(f λ )(x) and yf φ ρ (x) are on [ 1, 1], ξ 2 is a rando variable satisfying ξ 2 B. Applying the one-side Bernstein inequality as above, we know that there exists another subset U 2 of Z with easure at least 1 e t such that for every z U 2, 1 ξ 2 (z i ) IEξ 2 2Bt 3 + 2tσ2 (ξ 2 ). By (4.5), we have σ 2 (ξ 2 ) C τ (IEξ 2 ) τ. Applying the eleentary inequality with q = 2 2 τ, q = 2 τ 1 q + 1 q = 1 with q, q > 1 = a b 1 q aq + 1 q bq, a, b 0 and a = 2tC M, b = (IEξ 2 ) τ, we see that 2tσ2 (ξ 2 ) 2tCτ (IEξ 2 ) τ ( 1 τ ) ( ) 1 2 τ 2tC τ 2 + τ 2 IEξ 2. Hence 1 ξ 2 (z i ) IEξ 2 2Bt ( 3 + 2tCτ ) 1 2 τ + IEξ 2, z U 2. Cobine the above estiates for ξ 1 and ξ 2 with the fact IEξ 1 + IEξ 2 = IEξ D(λ) c β λ β. We conclude that S 2 7B λt + 4Bt 6 + ( 2tCτ ) 1 2 τ + D(λ), z U 1 U 2. (4.20) Step 2: estiate S 1. By Proposition 5, one has z := E φ (π(f z )) E φ (f φ ρ ) + λ f z 2 Σ S 1 + S 2 + D(λ). (4.21) Let R > 0. Apply Proposition 6 to the function set F = φ ( yπ(f)(x) ) φ ( yf φρ (x) ) : f B R. Since φ ( yπ(f)(x) ) φ ( yπ(g)(x) ) φ ( 1) π(f)(x) π(g)(x) φ ( 1) f(x) g(x), there holds N 2,z (F, ε) N 2,z (B R, Hence (4.6) yields (4.14) with a = c p φ ( 1) p R p. 25 ) ε φ. ( 1)

26 Since φ ( yπ(f)(x) ) B and φ ( yf φ ρ (x) ) B, we know that f B for every f F. The assuption (4.5) tells us that IEf 2 c(ief) τ with c = C τ. Thus all the conditions in Proposition 6 hold, and we know that there is a subset V(R) of Z with easure at least 1 e t such that for every z V(R) and every f B R, ( ) ( ) E φ (π(f)) E φ (fρ φ ) Ez φ (π(f)) Ez φ (fρ φ ) 1 2 η1 τ R ( E φ (π(f)) E φ (f φ ρ )) τ + c p η R + 2 ( Cτ t ) 1 2 τ + 18Bt, (4.22) where η R = η is given in Proposition 6 with c = C τ and a = c p φ ( 1) p R p, i.e., η R = ax C 2 p 4 2τ+pτ τ ( cp φ ( 1) p R p Let W(R) be the subset of Z defined by W(R) = ) 2 4 2τ+pτ, B 2 p 2+p z U 1 U 2 : f z B R. ( cp φ ( 1) p R p ) 2 2+p. Let z W(R) V(R). Then (4.22) holds for f z. Together with the estiate (4.20) for S 2 and (4.21), we know that z 1 2 η1 τ R When τ = 1 this yields ( E φ (π(f z )) E φ (f φ ρ )) τ + c p η R Bt + 3B λt/2 + 2D(λ). ( Cτ t ) 1/(2 τ) ( z c Cτ t ) 1/(2 τ) 38Bt + 3B λ t pη R D(λ), (4.23) where c p = ax2c p, 1. Here we have bounded 2c p by c p. When 0 < τ < 1, we use the eleentary inequality: if a, b > 0 and 0 < τ < 1, then x ax τ + b, x > 0 = x ax(2a) 1/(1 τ), 2b. We find that (4.23) still holds. By the choice of λ = λ() = ( 1 )γ, one easily checks that η R c p,τ λ β ax (R 2 λ 1 β p ) 4 2τ+pτ, (R 2 λ 1 β ) p 26 2+p

27 for soe c p,τ > 0. But 4 2τ + pτ 2 + p, hence if R > λ (1 β)/2, then η R c p,τ λ β (R 2 λ 1 β ) p 2+p = cp,τ λ p+2β 2+p R 2p 2+p. (4.24) The choice of λ together with the assuption D(λ) c β λ β and t > 1 on the regularization error also iplies ( Cτ t ) 1/(2 τ) 38Bt + 3B λ t D(λ) c q,τ,β tλ β (4.25) for soe c q,τ,β > 0. Putting the estiates (4.25) and (4.24) into (4.23) we obtain z c pc p.τ λ p+2β 2p 2+p R 2+p + cq,τ,β tλ β, z W(R) V(R) (4.26) whenever R > λ (1 β)/2. This iplies that f z Σ z /λ g(r), where g : IR + IR + is a univariate function defined as g(r) = c pc p,τ λ β 1 p 2+p R 2+p + cq,τ,β tλ (β 1)/2. (4.27) It follows that W(R) V(R) W(g(R)), R > λ (1 β)/2. (4.28) Step 3: by iteration, find a sall ball B R that, with high confidence, contains f z. Lea 1 eans that W(R 0 ) = U 1 U 2 for R 0 = φ(0)/λ. When R 0 > λ (1 β)/2, we use our conclusion (4.28) iteratively. Denote g [0] (R) = R, g [1] (R) = g(r) and g [l] (R) = g ( g [l 1] (R) ) for l 2. According to (4.28), if g [j] (R) > λ (1 β)/2, j = 0, 1,, l 1, (4.29) then W(R) V(R) V(g [1] (R)) V ( g [l 1] (R) ) W(g [l] (R)). (4.30) Observe that g(r) = d 0 R p 2+p + d1 with d 0, d 1 > 0 given in (4.27). Then g [2] ( ) (R) = d 0 d0 R p 2+p + p 2+p d1 + d 1 d 1+ p 27 2+p 0 R ( p 2+p )2 + d 1 + d 0 d p 2+p 1,

28 and in general, for l IN, g [l] (R) d 1+ p 2+p + +( p 2+p )l 1 0 R ( p + + d 1+ p 2+p + +( p 2+p )l 2 p 2+p )l 2+p + d 1 + d 0 d1 + d 1+ p 0 d ( p 2+p )l 1 1. This in connection with the expressions for d 0 and d 1 gives g [l] (R) d 2+p 2 1 ( p 0 R ( p 2+p )l 2+p )l + c 2+p 4 0 λ (β 1) 2 1 ( p 2+p )l R ( l 1 d i=0 p 2+p )l + i 1 j=0 ( p 2+p )j 0 d ( p 1 l 1 i=0 2+p 0 d ( p 2+p )2 2+p )i c 2+p 4 0 (c 1 t) ( p 2+p )i λ (β 1) 2, where c 0 = ax1, c pc p,τ and c 1 = ax1, c q,τ,β. In particular, for R = R 0, there holds g [l] (R 0 ) c 2+p 4 0 λ (β 1)/2 (φ(0)) 1 2 ( p 2+p )l λ β 2 ( p 1 2+p )l + c 1 tl. For ɛ > 0, choose l 0 IN such that l 0 log 1 2+p 2ε / log p. Then ( 1 p l0 2 2+p) ε. It follows that g [l0] (R 0 ) c 2+p 4 0 λ (β 1)/2 (φ(0)) 1 2 ( when (4.29) with l = l 0 and R = R 0 holds. p 2+p )l 0 λ βɛ + c 1 tl 0 When (4.29) with l = l 0 and R = R 0 is not valid, we have g [j 0] (R 0 ) λ (β 1)/2 for soe j 0 0, 1,, l 0 1. Take l ɛ = l 0 when (4.29) with l = l 0 and R = R 0 holds and l ɛ = j 0 otherwise. In both cases, we have where c ε := c 2+p ( (φ(0)) 2 ( p 2+p )l 0 ) + c 1 tl 0. g [l ɛ] (R 0 ) c ε λ (β 1)/2 βε =: R ε (4.31) Take l = l ɛ l 0 and R = R 0 in (4.30). Since W(R 0 ) = U 1 U 2, we know that there is a subset V ε of Z with easure at ost l 0 e t such that U 1 U 2 W(R ε ) V ε. Then the easure of the set W(R ε ) is at least 1 (l 0 + 2)e t. 28

29 Apply (4.23) with R = R ε and notice (4.25). Let z W(R ε ) V(R ε ). We know that z c pη Rε + c q,τ,β tλ (β 1)/2. It is easy to check that η Rε c p,τ c ε ( 1 )θ. Therefore, with the constant c = c pc p,τ c ε +c q,τ,β t, there holds ( 1 ) θ. E φ (π(f z )) E φ (fρ φ ) z c Taking t = log l 0+3 δ, the easure of the set W(R ε ) V(R ε ) is at least 1 δ. Then Theore 5 is proved. 6. Extensions A key point of our analysis is to find essential bounds for penalty functional values of regularization schees. This approach can be extended to regularization schees with ore general loss functions and general penalty functionals. Let the hypothesis space H be a function set containing 0. It is assigned a functional Ω : H IR + satisfying Ω(0) = 0. Beyond the ulti-kernel space H Σ, such a hypothesis space is the linear prograing support vector achine classifier [38] in a one kernel setting with the penalty functional Ω(f) defined for f H = H K,z = α iy i K xi : α i 0 as Ω(f) = α i. Let Y be a subset of IR, and V : IR 2 IR + be a general loss function. The general regularization schee in H associated with V and the penalty functional Ω is defined for the saple z as f V z = arg in f H 1 V (y i, f(x i )) + λω(f). (5.1) All the results we obtained for the ulti-kernel regularized classifiers (1.6) can be established for the ore general schee (5.1) under the assuption that the pair (V, ρ) is M-adissible: there is a constant M > 0 such that y M alost surely with respect to ρ, and for each y [ M, M], V (y, t) is a convex function of the variable t IR satisfying V (y, t) V (y, M), t > M (5.2) V (y, t) V (y, M), t < M. 29

30 An iportant faily of regularization schees (5.1) are those for regression with a general loss function: take Y = IR and V (y, f(x)) = ψ(y f(x)) where ψ : IR IR + is even, convex and increasing on [0, + ) with ψ(0) = 0. If y M alost surely with respect to ρ, then (V, ρ) is M-adissible. Our approach can be used to analyze the convergence of Z V (y, f z V (x))dρ to inf f H V (y, f(x))dρ. Z Exaple 4. Let ε > 0. The ε-insensitive nor is the univariate loss function ψ used for regression defined [35] as ψ(t) = ax t ε, 0. It would be interesting to analyze the convergence of the schee (5.1) as ε tends to zero. For the classification algorith (1.6), soe of our error bounds can be extended to nonclassifying loss functions (such as the exponential loss), i.e., those activating loss functions whose infiu cannot be achieved. For this purpose, we need a ore general projection operator. Definition 8. For M > 0, the projection operator at level M is defined on the space of easurable functions f : X IR as M, if f(x) > M, π M (f)(x) = M, if f(x) < M, f(x), if M f(x) M. Using this projection operator, we can have siilar error decopositions by revising the regularization error and introducing level M adapting to the behavior of the loss function (the convergence rate of φ(t) as t ). Then soe learning rates can be obtained, following our approach. Acknowledgeent. When the paper is being revised as requested, we learn that a kernel-searching ethod, leading to the regularization schee (1.6), is studied recently in [22]. The learnability of ulti-kernel spaces associated with Gaussian kernels with flexible variances, i.e., Σ = (0, + ) in Exaple 3, is also verified recently in [39]. We thank the referees for their careful reading and constructive suggestions which help us iprove the paper. References 30

31 [1] N. Aronszajn, Theory of reproducing kernels, Trans. Aer. Math. Soc. 68 (1950), [2] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, Convexity, classification, and risk bounds, preprint, [3] B. Blanchard, O. Bousquet and P. Massart, Statistical perforance of support vector achines, preprint, [4] B. Blanchard, G. Lugosi and N. Vayatis, On the rate of convergence of regularized boosting classifiers, J. Mach. Learning Res. 4 (2003), [5] B. E. Boser, I. Guyon, and V. Vapnik, A training algorith for optial argin classifiers, in Proceedings of the Fifth Annual Workshop of Coputational Learning Theory 5 (1992), Pittsburgh, ACM, pp [6] O. Bousquet and A. Elisseeff, Stability and generalization, J. Mach. Learning Res. 2 (2002), [7] L. Breian, Arcing classifiers, (discussion paper) Ann. Stat. 26 (1998), [8] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, Choosing ultiple paraeters for support vector achines, Mach. Learning 46 (2002), [9] D. R. Chen, Q. Wu, Y. Ying and D. X. Zhou, Support vector achine soft argin classifiers: error analysis, J. Mach. Learning Res. 5 (2004), [10] C. Cortes and V. Vapnik, Support-vector networks, Mach. Learning 20 (1995), [11] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cabridge University Press, [12] F. Cucker and S. Sale, On the atheatical foundations of learning, Bull. Aer. Math. Soc. 39 (2001), [13] F. Cucker and S. Sale, Best choices for regularization paraeters in learning theory: On the bias-variance proble, Found. Coput. Math. 2 (2002), [14] F. Cucker and D. X. Zhou, Learning Theory: an Approxiation Theory Viewpoint, onograph anuscript in preparation for Cabridge University Press. 31

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming

SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming SVM Soft Margin Classifiers: Linear Prograing versus Quadratic Prograing Qiang Wu wu.qiang@student.cityu.edu.hk Ding-Xuan Zhou azhou@cityu.edu.hk Departent of Matheatics, City University of Hong Kong,

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Yongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011

Yongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011 This article was downloaded by: [Xi'an Jiaotong University] On: 15 Noveber 2011, At: 18:34 Publisher: Taylor & Francis Infora Ltd Registered in England and Wales Registered Nuber: 1072954 Registered office:

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions Journal of Matheatical Research with Applications Jul., 207, Vol. 37, No. 4, pp. 496 504 DOI:0.3770/j.issn:2095-265.207.04.0 Http://jre.dlut.edu.cn An l Regularized Method for Nuerical Differentiation

More information

Symmetrization and Rademacher Averages

Symmetrization and Rademacher Averages Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Supplement to: Subsampling Methods for Persistent Homology

Supplement to: Subsampling Methods for Persistent Homology Suppleent to: Subsapling Methods for Persistent Hoology A. Technical results In this section, we present soe technical results that will be used to prove the ain theores. First, we expand the notation

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Introduction to Kernel methods

Introduction to Kernel methods Introduction to Kernel ethods ML Workshop, ISI Kolkata Chiranjib Bhattacharyya Machine Learning lab Dept of CSA, IISc chiru@csa.iisc.ernet.in http://drona.csa.iisc.ernet.in/~chiru 19th Oct, 2012 Introduction

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions Kernel Choice and Classifiability for RKHS Ebeddings of Probability Distributions Bharath K. Sriperubudur Departent of ECE UC San Diego, La Jolla, USA bharathsv@ucsd.edu Kenji Fukuizu The Institute of

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Manifold learning via Multi-Penalty Regularization

Manifold learning via Multi-Penalty Regularization Manifold learning via Multi-Penalty Regularization Abhishake Rastogi Departent of Matheatics Indian Institute of Technology Delhi New Delhi 006, India abhishekrastogi202@gail.co Abstract Manifold regularization

More information

Metric Entropy of Convex Hulls

Metric Entropy of Convex Hulls Metric Entropy of Convex Hulls Fuchang Gao University of Idaho Abstract Let T be a precopact subset of a Hilbert space. The etric entropy of the convex hull of T is estiated in ters of the etric entropy

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Soft Coputing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Beverly Rivera 1,2, Irbis Gallegos 1, and Vladik Kreinovich 2 1 Regional Cyber and Energy Security Center RCES

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion Suppleentary Material for Fast and Provable Algoriths for Spectrally Sparse Signal Reconstruction via Low-Ran Hanel Matrix Copletion Jian-Feng Cai Tianing Wang Ke Wei March 1, 017 Abstract We establish

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed

More information

Universal algorithms for learning theory Part II : piecewise polynomial functions

Universal algorithms for learning theory Part II : piecewise polynomial functions Universal algoriths for learning theory Part II : piecewise polynoial functions Peter Binev, Albert Cohen, Wolfgang Dahen, and Ronald DeVore Deceber 6, 2005 Abstract This paper is concerned with estiating

More information

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Tail estimates for norms of sums of log-concave random vectors

Tail estimates for norms of sums of log-concave random vectors Tail estiates for nors of sus of log-concave rando vectors Rados law Adaczak Rafa l Lata la Alexander E. Litvak Alain Pajor Nicole Toczak-Jaegerann Abstract We establish new tail estiates for order statistics

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

OPTIMIZATION in multi-agent networks has attracted

OPTIMIZATION in multi-agent networks has attracted Distributed constrained optiization and consensus in uncertain networks via proxial iniization Kostas Margellos, Alessandro Falsone, Sione Garatti and Maria Prandini arxiv:603.039v3 [ath.oc] 3 May 07 Abstract

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

On the Use of A Priori Information for Sparse Signal Approximations

On the Use of A Priori Information for Sparse Signal Approximations ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution Testing approxiate norality of an estiator using the estiated MSE and bias with an application to the shape paraeter of the generalized Pareto distribution J. Martin van Zyl Abstract In this work the norality

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

Journal of Mathematical Analysis and Applications

Journal of Mathematical Analysis and Applications J Math Anal Appl 386 202 205 22 Contents lists available at ScienceDirect Journal of Matheatical Analysis and Applications wwwelsevierco/locate/jaa Sei-Supervised Learning with the help of Parzen Windows

More information

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing

More information

Lecture 9: Multi Kernel SVM

Lecture 9: Multi Kernel SVM Lecture 9: Multi Kernel SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 204 April 6, 204 Roadap Tuning the kernel: MKL The ultiple kernel proble Sparse kernel achines for regression: SVR SipleMKL:

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

Tail Estimation of the Spectral Density under Fixed-Domain Asymptotics

Tail Estimation of the Spectral Density under Fixed-Domain Asymptotics Tail Estiation of the Spectral Density under Fixed-Doain Asyptotics Wei-Ying Wu, Chae Young Li and Yiin Xiao Wei-Ying Wu, Departent of Statistics & Probability Michigan State University, East Lansing,

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

arxiv: v3 [cs.lg] 7 Jan 2016

arxiv: v3 [cs.lg] 7 Jan 2016 Efficient and Parsionious Agnostic Active Learning Tzu-Kuo Huang Alekh Agarwal Daniel J. Hsu tkhuang@icrosoft.co alekha@icrosoft.co djhsu@cs.colubia.edu John Langford Robert E. Schapire jcl@icrosoft.co

More information

A Bernstein-Markov Theorem for Normed Spaces

A Bernstein-Markov Theorem for Normed Spaces A Bernstein-Markov Theore for Nored Spaces Lawrence A. Harris Departent of Matheatics, University of Kentucky Lexington, Kentucky 40506-0027 Abstract Let X and Y be real nored linear spaces and let φ :

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information

Max-Product Shepard Approximation Operators

Max-Product Shepard Approximation Operators Max-Product Shepard Approxiation Operators Barnabás Bede 1, Hajie Nobuhara 2, János Fodor 3, Kaoru Hirota 2 1 Departent of Mechanical and Syste Engineering, Bánki Donát Faculty of Mechanical Engineering,

More information

L p moments of random vectors via majorizing measures

L p moments of random vectors via majorizing measures L p oents of rando vectors via ajorizing easures Olivier Guédon, Mark Rudelson Abstract For a rando vector X in R n, we obtain bounds on the size of a saple, for which the epirical p-th oents of linear

More information

Research Article Robust ε-support Vector Regression

Research Article Robust ε-support Vector Regression Matheatical Probles in Engineering, Article ID 373571, 5 pages http://dx.doi.org/10.1155/2014/373571 Research Article Robust ε-support Vector Regression Yuan Lv and Zhong Gan School of Mechanical Engineering,

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION A eshsize boosting algorith in kernel density estiation A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION C.C. Ishiekwene, S.M. Ogbonwan and J.E. Osewenkhae Departent of Matheatics, University

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Generalized eigenfunctions and a Borel Theorem on the Sierpinski Gasket.

Generalized eigenfunctions and a Borel Theorem on the Sierpinski Gasket. Generalized eigenfunctions and a Borel Theore on the Sierpinski Gasket. Kasso A. Okoudjou, Luke G. Rogers, and Robert S. Strichartz May 26, 2006 1 Introduction There is a well developed theory (see [5,

More information

VC Dimension and Sauer s Lemma

VC Dimension and Sauer s Lemma CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Convex Programming for Scheduling Unrelated Parallel Machines

Convex Programming for Scheduling Unrelated Parallel Machines Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly

More information

A Theoretical Framework for Deep Transfer Learning

A Theoretical Framework for Deep Transfer Learning A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008 LIDS Report 2779 1 Constrained Consensus and Optiization in Multi-Agent Networks arxiv:0802.3922v2 [ath.oc] 17 Dec 2008 Angelia Nedić, Asuan Ozdaglar, and Pablo A. Parrilo February 15, 2013 Abstract We

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information